互联网广告背景
DSP原理图
DSP:各各广告主的代理商,帮助广告主投放广告,也是一个Web平台,可以存储广告主的诉求信息(目标用户画像)。
DMP:保存用户画像。
流程解析
1.当用户打开APP,APP会发送一条请求到Ad Exchange(广告交易平台),请求中携带有用户相关信息(userId);
2.一个Ad Exchange平台与多个DSP平台合作,Ad Exchange接收到APP的请求后,将用户信息发送至多个DSP平台;
3.DSP平台前面设有广告投放引擎,将接收到的用户信息与DMP用户定向相匹配;匹配成功后,该DSP平台参与竞价。
4.Ad Exchange平台根据竞价,将最合适的广告投放到App上,总时长大概200ms.
DMP
DMP做用户画像,依赖于大量的日志数据。
广告投放引擎与Ad Exchange有大量的数据交互,广告投放引擎会把用户数据保存下来。
项目流程图
日志文件转成parquet文件
普通的方法
package cn.dmp.tools
import cn.dmp.utils.{NBF, SchemaUtils}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* 将原始日志文件转换成parquet文件格式
* 采用snappy压缩格式
*/
object Bzip2Parquet {
def main(args: Array[String]): Unit = {
// 0 校验参数个数
if (args.length != 3) {
println(
"""
|cn.dmp.tools.Bzip2Parquet
|参数:
| logInputPath
| compressionCode <snappy, gzip, lzo>
| resultOutputPath
""".stripMargin)
sys.exit()
}
// 1 接受程序参数
val Array(logInputPath, compressionCode,resultOutputPath) = args
// 2 创建sparkconf->sparkContext
val sparkConf = new SparkConf()
sparkConf.setAppName(s"${this.getClass.getSimpleName}")
sparkConf.setMaster("local[*]")
// RDD 序列化到磁盘 worker与worker之间的数据传输
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val sQLContext = new SQLContext(sc)
sQLContext.setConf("spark.sql.parquet.compression.codec", compressionCode)
// 3 读取日志数据
val rawdata = sc.textFile(logInputPath)
// 4 根据业务需求对数据进行ETL xxxx,x,x,x,x,,,,,
val dataRow: RDD[Row] = rawdata
.map(line => line.split(",", line.length))
.filter(_.length >= 85)
.map(arr => {
Row(
arr(0),
NBF.toInt(arr(1)),
NBF.toInt(arr(2)),
NBF.toInt(arr(3)),
NBF.toInt(arr(4)),
arr(5),
arr(6),
NBF.toInt(arr(7)),
NBF.toInt(arr(8)),
NBF.toDouble(arr(9)),
NBF.toDouble(arr(10)),
arr(11),
arr(12),
arr(13),
arr(14),
arr(15),
arr(16),
NBF.toInt(arr(17)),
arr(18),
arr(19),
NBF.toInt(arr(20)),
NBF.toInt(arr(21)),
arr(22),
arr(23),
arr(24),
arr(25),
NBF.toInt(arr(26)),
arr(27),
NBF.toInt(arr(28)),
arr(29),
NBF.toInt(arr(30)),
NBF.toInt(arr(31)),
NBF.toInt(arr(32)),
arr(33),
NBF.toInt(arr(34)),
NBF.toInt(arr(35)),
NBF.toInt(arr(36)),
arr(37),
NBF.toInt(arr(38)),
NBF.toInt(arr(39)),
NBF.toDouble(arr(40)),
NBF.toDouble(arr(41)),
NBF.toInt(arr(42)),
arr(43),
NBF.toDouble(arr(44)),
NBF.toDouble(arr(45)),
arr(46),
arr(47),
arr(48),
arr(49),
arr(50),
arr(51),
arr(52),
arr(53),
arr(54),
arr(55),
arr(56),
NBF.toInt(arr(57)),
NBF.toDouble(arr(58)),
NBF.toInt(arr(59)),
NBF.toInt(arr(60)),
arr(61),
arr(62),
arr(63),
arr(64),
arr(65),
arr(66),
arr(67),
arr(68),
arr(69),
arr(70),
arr(71),
arr(72),
NBF.toInt(arr(73)),
NBF.toDouble(arr(74)),
NBF.toDouble(arr(75)),
NBF.toDouble(arr(76)),
NBF.toDouble(arr(77)),
NBF.toDouble(arr(78)),
arr(79),
arr(80),
arr(81),
arr(82),
arr(83),
NBF.toInt(arr(84))
)
})
// 5 将结果存储到本地磁盘
val dataFrame = sQLContext.createDataFrame(dataRow, SchemaUtils.logStructType)
dataFrame.write.parquet(resultOutputPath)
// 6 关闭sc
sc.stop()
}
}
package cn.dmp.utils
import org.apache.spark.sql.types._
object SchemaUtils {
/**
* 定义日志的Schema结构信息
*/
val logStructType = StructType(Seq(
StructField("sessionid", StringType),
StructField("advertisersid", IntegerType),
StructField("adorderid", IntegerType),
StructField("adcreativeid", IntegerType),
StructField("adplatformproviderid", IntegerType),
StructField("sdkversion", StringType),
StructField("adplatformkey", StringType),
StructField("putinmodeltype", IntegerType),
StructField("requestmode", IntegerType),
StructField("adprice", DoubleType),
StructField("adppprice", DoubleType),
StructField("requestdate", StringType),
StructField("ip", StringType),
StructField("appid", StringType),
StructField("appname", StringType),
StructField("uuid", StringType),
StructField("device", StringType),
StructField("client", IntegerType),
StructField("osversion", StringType),
StructField("density", StringType),
StructField("pw", IntegerType),
StructField("ph", IntegerType),
StructField("long", StringType),
StructField("lat", StringType),
StructField("provincename", StringType),
StructField("cityname", StringType),
StructField("ispid", IntegerType),
StructField("ispname", StringType),
StructField("networkmannerid", IntegerType),
StructField("networkmannername", StringType),
StructField("iseffective", IntegerType),
StructField("isbilling", IntegerType),
StructField("adspacetype", IntegerType),
StructField("adspacetypename", StringType),
StructField("devicetype", IntegerType),
StructField("processnode", IntegerType),
StructField("apptype", IntegerType),
StructField("district", StringType),
StructField("paymode", IntegerType),
StructField("isbid", IntegerType),
StructField("bidprice", DoubleType),
StructField("winprice", DoubleType),
StructField("iswin", IntegerType),
StructField("cur", StringType),
StructField("rate", DoubleType),
StructField("cnywinprice", DoubleType),
StructField("imei", StringType),
StructField("mac", StringType),
StructField("idfa", StringType),
StructField("openudid", StringType),
StructField("androidid", StringType),
StructField("rtbprovince", StringType),
StructField("rtbcity", StringType),
StructField("rtbdistrict", StringType),
StructField("rtbstreet", StringType),
StructField("storeurl", StringType),
StructField("realip", StringType),
StructField("isqualityapp", IntegerType),
StructField("bidfloor", DoubleType),
StructField("aw", IntegerType),
StructField("ah", IntegerType),
StructField("imeimd5", StringType),
StructField("macmd5", StringType),
StructField("idfamd5", StringType),
StructField("openudidmd5", StringType),
StructField("androididmd5", StringType),
StructField("imeisha1", StringType),
StructField("macsha1", StringType),
StructField("idfasha1", StringType),
StructField("openudidsha1", StringType),
StructField("androididsha1", StringType),
StructField("uuidunknow", StringType),
StructField("userid", StringType),
StructField("iptype", IntegerType),
StructField("initbidprice", DoubleType),
StructField("adpayment", DoubleType),
StructField("agentrate", DoubleType),
StructField("lomarkrate", DoubleType),
StructField("adxrate", DoubleType),
StructField("title", StringType),
StructField("keywords", StringType),
StructField("tagid", StringType),
StructField("callbackdate", StringType),
StructField("channelid", StringType),
StructField("mediatype", IntegerType)
))
}
package cn.dmp.utils
object NBF {
def toInt(str: String): Int = {
try {
str.toInt
} catch {
case _: Exception => 0
}
}
def toDouble(str: String): Double = {
try {
str.toDouble
} catch {
case _: Exception => 0
}
}
}
将数据封装到对象中
package cn.dmp.tools
import cn.dmp.beans.Log
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
/**
* 日志转成parquet文件格式
*
* 使用自定义类的方式构建schema信息
*/
object Biz2ParquetV2 {
def main(args: Array[String]): Unit = {
// 0 校验参数个数
if (args.length != 3) {
println(
"""
|cn.dmp.tools.Bzip2Parquet
|参数:
| logInputPath
| compressionCode <snappy, gzip, lzo>
| resultOutputPath
""".stripMargin)
sys.exit()
}
// 1 接受程序参数
val Array(logInputPath, compressionCode,resultOutputPath) = args
// 2 创建sparkconf->sparkContext
val sparkConf = new SparkConf()
sparkConf.setAppName(s"${this.getClass.getSimpleName}")
sparkConf.setMaster("local[*]")
// RDD 序列化到磁盘 worker与worker之间的数据传输
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// 注册自定义类的序列化方式
sparkConf.registerKryoClasses(Array(classOf[Log]))
val sc = new SparkContext(sparkConf)
val sQLContext = new SQLContext(sc)
sQLContext.setConf("spark.sql.parquet.compression.codec", compressionCode)
// 读取日志文件
val dataLog: RDD[Log] = sc.textFile(logInputPath)
.map(line => line.split(",", -1))
.filter(_.length >= 85).map(arr => Log(arr))
val dataFrame = sQLContext.createDataFrame(dataLog)
// 按照省份名称及地市名称对数据进行分区
dataFrame.write.partitionBy("provincename", "cityname").parquet(resultOutputPath)
sc.stop()
}
}
package cn.dmp.beans
import cn.dmp.utils.NBF
class Log(val sessionid: String,
val advertisersid: Int,
val adorderid: Int,
val adcreativeid: Int,
val adplatformproviderid: Int,
val sdkversion: String,
val adplatformkey: String,
val putinmodeltype: Int,
val requestmode: Int,
val adprice: Double,
val adppprice: Double,
val requestdate: String,
val ip: String,
val appid: String,
val appname: String,
val uuid: String,
val device: String,
val client: Int,
val osversion: String,
val density: String,
val pw: Int,
val ph: Int,
val long: String,
val lat: String,
val provincename: String,
val cityname: String,
val ispid: Int,
val ispname: String,
val networkmannerid: Int,
val networkmannername: String,
val iseffective: Int,
val isbilling: Int,
val adspacetype: Int,
val adspacetypename: String,
val devicetype: Int,
val processnode: Int,
val apptype: Int,
val district: String,
val paymode: Int,
val isbid: Int,
val bidprice: Double,
val winprice: Double,
val iswin: Int,
val cur: String,
val rate: Double,
val cnywinprice: Double,
val imei: String,
val mac: String,
val idfa: String,
val openudid: String,
val androidid: String,
val rtbprovince: String,
val rtbcity: String,
val rtbdistrict: String,
val rtbstreet: String,
val storeurl: String,
val realip: String,
val isqualityapp: Int,
val bidfloor: Double,
val aw: Int,
val ah: Int,
val imeimd5: String,
val macmd5: String,
val idfamd5: String,
val openudidmd5: String,
val androididmd5: String,
val imeisha1: String,
val macsha1: String,
val idfasha1: String,
val openudidsha1: String,
val androididsha1: String,
val uuidunknow: String,
val userid: String,
val iptype: Int,
val initbidprice: Double,
val adpayment: Double,
val agentrate: Double,
val lomarkrate: Double,
val adxrate: Double,
val title: String,
val keywords: String,
val tagid: String,
val callbackdate: String,
val channelid: String,
val mediatype: Int) extends Product with Serializable{
// 角标和成员属性的映射关系
override def productElement(n: Int): Any = n match {
case 0 => sessionid
case 1 => advertisersid
case 2 => adorderid
case 3 => adcreativeid
case 4 => adplatformproviderid
case 5 => sdkversion
case 6 => adplatformkey
case 7 => putinmodeltype
case 8 => requestmode
case 9 => adprice
case 10 => adppprice
case 11 => requestdate
case 12 => ip
case 13 => appid
case 14 => appname
case 15 => uuid
case 16 => device
case 17 => client
case 18 => osversion
case 19 => density
case 20 => pw
case 21 => ph
case 22 => long
case 23 => lat
case 24 => provincename
case 25 => cityname
case 26 => ispid
case 27 => ispname
case 28 => networkmannerid
case 29 => networkmannername
case 30 => iseffective
case 31 => isbilling
case 32 => adspacetype
case 33 => adspacetypename
case 34 => devicetype
case 35 => processnode
case 36 => apptype
case 37 => district
case 38 => paymode
case 39 => isbid
case 40 => bidprice
case 41 => winprice
case 42 => iswin
case 43 => cur
case 44 => rate
case 45 => cnywinprice
case 46 => imei
case 47 => mac
case 48 => idfa
case 49 => openudid
case 50 => androidid
case 51 => rtbprovince
case 52 => rtbcity
case 53 => rtbdistrict
case 54 => rtbstreet
case 55 => storeurl
case 56 => realip
case 57 => isqualityapp
case 58 => bidfloor
case 59 => aw
case 60 => ah
case 61 => imeimd5
case 62 => macmd5
case 63 => idfamd5
case 64 => openudidmd5
case 65 => androididmd5
case 66 => imeisha1
case 67 => macsha1
case 68 => idfasha1
case 69 => openudidsha1
case 70 => androididsha1
case 71 => uuidunknow
case 72 => userid
case 73 => iptype
case 74 => initbidprice
case 75 => adpayment
case 76 => agentrate
case 77 => lomarkrate
case 78 => adxrate
case 79 => title
case 80 => keywords
case 81 => tagid
case 82 => callbackdate
case 83 => channelid
case 84 => mediatype
}
// 对象一个又多少个成员属性
override def productArity: Int = 85
// 比较两个对象是否是同一个对象
override def canEqual(that: Any): Boolean = that.isInstanceOf[Log]
}
object Log {
def apply(arr: Array[String]): Log = new Log(
arr(0),
NBF.toInt(arr(1)),
NBF.toInt(arr(2)),
NBF.toInt(arr(3)),
NBF.toInt(arr(4)),
arr(5),
arr(6),
NBF.toInt(arr(7)),
NBF.toInt(arr(8)),
NBF.toDouble(arr(9)),
NBF.toDouble(arr(10)),
arr(11),
arr(12),
arr(13),
arr(14),
arr(15),
arr(16),
NBF.toInt(arr(17)),
arr(18),
arr(19),
NBF.toInt(arr(20)),
NBF.toInt(arr(21)),
arr(22),
arr(23),
arr(24),
arr(25),
NBF.toInt(arr(26)),
arr(27),
NBF.toInt(arr(28)),
arr(29),
NBF.toInt(arr(30)),
NBF.toInt(arr(31)),
NBF.toInt(arr(32)),
arr(33),
NBF.toInt(arr(34)),
NBF.toInt(arr(35)),
NBF.toInt(arr(36)),
arr(37),
NBF.toInt(arr(38)),
NBF.toInt(arr(39)),
NBF.toDouble(arr(40)),
NBF.toDouble(arr(41)),
NBF.toInt(arr(42)),
arr(43),
NBF.toDouble(arr(44)),
NBF.toDouble(arr(45)),
arr(46),
arr(47),
arr(48),
arr(49),
arr(50),
arr(51),
arr(52),
arr(53),
arr(54),
arr(55),
arr(56),
NBF.toInt(arr(57)),
NBF.toDouble(arr(58)),
NBF.toInt(arr(59)),
NBF.toInt(arr(60)),
arr(61),
arr(62),
arr(63),
arr(64),
arr(65),
arr(66),
arr(67),
arr(68),
arr(69),
arr(70),
arr(71),
arr(72),
NBF.toInt(arr(73)),
NBF.toDouble(arr(74)),
NBF.toDouble(arr(75)),
NBF.toDouble(arr(76)),
NBF.toDouble(arr(77)),
NBF.toDouble(arr(78)),
arr(79),
arr(80),
arr(81),
arr(82),
arr(83),
NBF.toInt(arr(84))
)
}
性能调优
有Spark的jvm性能调优经验吗?
job中有Shuffle就一定运行的慢吗?
nohup spark-submit --class com.initialize.dmp.log.Biz2Parquet --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 8g --executor-cores 1 --num-executors 20 /dmp-1.0.jar /adlogs/biz2/* /parquet_20_3 &
nohup spark-submit --class com.initialize.dmp.log.Biz2Parquet --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 8g --executor-cores 5 --num-executors 20 /home/hdfs/dmp-1.0.jar /adlogs/biz2/* /parquet_20_7 300 &
增加任务的并行度
executor-memory:的大小和num-executors有关系,他们的乘积不能大于集群中总的内容容量大小。
注意:做乘积的时候,executor-memory的多+1g
executor-cores:
executor-cores * num - executors <= 集群中的总的核数容量
一个executor如果只分配了一个核的话,在这个executor中的线程数量统一时刻只能有一个(Task),并且是串行。
如果executor分配了N核,在这个executor中的task都是并行的,并行的最大数量就是N
num-executors
申请的总的executor数量,executor的数量最好和分区数量成倍数关系
partitionNum