因为业务逻辑的修改,投放数据存入大数据集群中,因此,需要修改之前的业务逻辑,需要实时知道rtb投放的花费情况。
环境版本:
spark: 2.11-2.4.0-cdh6.2.0
kafka: 2.1.0-cdh6.2.0
fluem: 1.9.0-cdh6.2.0
- 1. Flume配置
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#sources
a1.sources.r1.type = spooldir # 监控目录方式读取新增的日志文件
a1.sources.r1.recursiveDirectorySearch = true # flume在1.9版本中已实现递归监视子文件夹里的有新文件产生的功能
a1.sources.r1.spoolDir = /mnt/data1/flume_data/bid_bsw # flume监控的目录
a1.sources.r1.fileHeader = false
a1.sources.r1.ignorePattern = ^(.)*\\.[0|1]{1}$
a1.sources.r1.includePattern = ^bid.log.[0-9]{12}$
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = statistic_bid_topic # 读取的topic
a1.sinks.k1.kafka.bootstrap.servers = 192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092
a1.sinks.k1.kafka.producer.acks=1
a1.sinks.k1.kafka.flumeBatchSize =20
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
说明:
a1.sources.r1.spoolDir配置监控bid日志的目录;
a1.sources.r1.recursiveDirectorySearch配置此监控目录可以递归监控是否有新的文件产生;
a1.sinks.k1.kafka.bootstrap.servers配置kafka的Destination Broker List值
- 2. kafka安装和配置
在cdh集群:操作->添加服务->Kafka进行相应添加
安装cdh-master,cdh-slaver1,cdh-slaver2作为kafka broker;
cdh-master作为Kafka MirrorMaker
Kafka配置:
zookeeper.chroot:设置为/kafka
auto.create.topics.enable:勾选
delete.topic.enable:勾选
broker.id
log.dirs:/XXX/kafka/data
注意:如果此路径的磁盘空间不足,kafka会报错关闭!
bootstrap.servers:192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092 (安装kafka的服务器,如上图所示)
source.bootstrap.servers:192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092
whitelist:192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092
oom_heap_dump_dir:/tmp
kafka设置topic命令:
到kafka的安装bin目录下,执行:
./kafka-topics.sh --create --topic statistic_bid_topic --zookeeper cdh-master.lavapm:2181,cdh-slaver1.lavapm:2181,cdh-slaver2.lavapm:2181/kafka --partitions 3 --replication-factor 2
设置了topic名字为statistic_bid_topic,设置了3个分区,2个副本;
查看所有的topic命令:
./kafka-topics.sh --zookeeper 192.168.1.65:2181/kafka --list
查看某个topic详细信息:
列出了statistic_bid_topic的parition数量、replica因子以及每个partition的leader、replica信息,命令为:
./kafka-topics.sh --zookeeper 192.168.1.65:2181/kafka --topic statistic_bid_topic --describe
查看consumer group列表
使用--list参数,命令为:
./kafka-consumer-groups.sh --bootstrap-server 192.168.1.65:9092 --list
查看特定consumer group 详情
使用--group与--describe参数,运行命令:
./kafka-consumer-groups.sh --bootstrap-server 192.168.1.65:9092 --group statistic_bid_topic_Group –describe
接收消息,从头开始接收全部数据命令:
./kafka-console-consumer.sh --bootstrap-server 192.168.1.65:9092 --topic statistic_bid_topic --from-beginning
生产消息命令:
./kafka-console-producer.sh --broker-list cdh-master.lavapm:9092 --topic statistic_bid_topic
- 3.spark streaming开发
实时读取kafka对应topic中的bid数据,一行一行解析json字符串,获取对应的imp,bid,click数据,然后做统计和合并,最后以增量方式存入redis中,因为是用于每天的投放统计,因此,存储的key的格式为:年月日::oid::推广单元id,如下图所示:
具体某个推广单元的详细数据如下图所示:
代码实现使用scala:
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.HashPartitioner
import scala.util.parsing.json._
import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}
object KafkaBidStatistic extends Serializable {
def regJson(json: Option[Any]) = json match {
// 转换类型
case Some(map: collection.immutable.Map[String, Any]) => map
}
/**
* 解析多层json
*
* @param string_json
* @return
*/
def str_json(string_json: String): collection.immutable.Map[String, Any] = {
var first: collection.immutable.Map[String, Any] = collection.immutable.Map()
val jsonS = JSON.parseFull(string_json)
//不确定数据的类型时,此处加异常判断
if (jsonS.isInstanceOf[Option[Any]]) {
first = regJson(jsonS);
}
first
}
def get_str_json(line: String): String = {
var ret = "";
try{
var first: collection.immutable.Map[String, Any] = str_json(line);
// 根据imp日志数据的特征,提取一种imp日志数据作为统计值
// imp日志数据列子:{"cip":"223.24.172.199","mid":"687","pid":"","oid":"284","et":{"bid":1557196078,"imp":1557196080},"wd":{},"ip":"35.200.110.13","ov":1,"ex":20191211,"url":"","usc":["IAB18"],"uri":"\/imp?ext_data=b2lkPTI4NCxjaWQ9LHV1aWQ9VURhSlVmQXEybW1acmVpSTZzWVJCZSxtZT1iaWRzd2l0Y2gsZXg9Ymlkc3dpdGNoLHBjPTAsbGlkPSxuYXQ9MCx2cj0wLHV0PTE1NTcxOTYwNzgsZG9tPSxwaWQ9LG9jYz1jaW0sY2NwPTU2MDAwMCxhaWQ9Njg3LHk9MCxhdj0w&ver=1&reqid=bidswitch&price=0.55992","occ":"cim","me":"bidswitch","mobile":{"id":"","os":2},"ua":"BidSwitch\/1.0","id":"UDaJUfAq2mmZreiI6sYRBe","op":"","ccp":1,"ins":{},"sz":"320x50","reuse":1,"wp":0.55992,"is_mobile":1,"nwp":1,"ev":"imp","rg":1840000000,"did":"CDz2mRv7s7fffMuVY3eeuy"}
// && first.contains("uri") 国内投放没有此值!
if (first("ev").toString == "imp" && first.contains("me")) { // first.getOrElse("me" , 1) && first("me").toString=="bidswitch"
val wp = first("wp");
val mid = first("mid").toString;
val oid = first("oid").toString;
val ev = first("ev").toString;
val id = first("id").toString;
ret = oid + "," + mid + "," + wp + ",\"" + ev + "\",\"" + id + "\""; // oid,mid,wp,"imp","id"
}
else if (first("ev").toString == "bid") {
val mid = first("mid").toString;
val oid = first("oid").toString;
val id = first("id").toString;
val ev = first("ev").toString;
ret = oid + "," + mid + ",\"" + ev + "\",\"" + id + "\""; // oid,mid,"bid","id"
}
else if (first("ev").toString == "clk") {
val mid = first("mid").toString;
val oid = first("oid").toString;
val id = first("id").toString;
val ev = first("ev").toString;
ret = oid + "," + mid + ",\"" + ev + "\",\"" + id + "\""; // oid,mid,"clk","id"
}
}
catch{
case ex: Exception => {
ex.printStackTrace() // 打印到标准err
System.err.println("get_str_json()解析字符串异常! line = "+line) // 打印到标准err
}
}
ret;
}
def main(args: Array[String]): Unit = {
// Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.ERROR)
Logger.getLogger("org.apache.kafka.clients.consumer").setLevel(Level.ERROR)
val conf = new SparkConf();
conf.setMaster("yarn").setAppName("statisticBid2Redis")
//每隔2秒监测一次是否有数据
val ssc = new StreamingContext(conf, Seconds.apply(2))
val ssce = SparkSession.builder().config(conf).getOrCreate
// 提高Job并发数
// spark.conf.set("spark.streaming.concurrentJobs", 10)
// 获取topic分区leaders及其最新offsets时,调大重试次数。 Receiver模式下使用
//spark.conf.set("spark.streaming.kafka.maxRetries", 50)
//开启反压
ssce.conf.set("spark.streaming.backpressure.enabled",true)
//确保在kill任务时,能够处理完最后一批数据,再关闭程序,不会发生强制kill导致数据处理中断,没处理完的数据丢失
ssce.conf.set("spark.streaming.stopGracefullyOnShutdown",true)
//反压机制时初始化的摄入量,该参数只对receiver模式起作用,并不适用于direct模式
ssce.conf.set("spark.streaming.backpressure.initialRate",5000)
// 设置每秒每个分区最大获取日志数,控制处理数据量,保证数据均匀处理。
// 每次作业中每个,Kafka分区最多读取的记录条数。可以防止第一个批次流量出现过载情况,也可以防止批次处理出现数据流量过载情况
ssce.conf.set("spark.streaming.kafka.maxRatePerPartition",3000)
val sc = ssce.sparkContext;
println("********* spark.default.parallelism *********"+sc.defaultParallelism)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092", // kafka 集群
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "statistic_bid_topic_Group", // statistic_bid_topic_Group
"auto.offset.reset" -> "latest", // 1.earliest 每次都是从头开始消费(from-beginning)2.latest 消费最新消息 3.smallest 从最早的消息开始读取
"enable.auto.commit" -> (true: java.lang.Boolean) //true , false=手动提交offset
)
val topics = Array("statistic_bid_topic") //主题,可配置多个
//Direct方式
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD { kafkaRDD =>
if (!kafkaRDD.isEmpty()) {
val now: Date = new Date();
val dateFormat: SimpleDateFormat = new SimpleDateFormat("yyyyMMdd");
val date_str = dateFormat.format(now);
val df: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");//yyyyMMddHHmmss
// val offsetRanges = kafkaRDD.asInstanceOf[HasOffsetRanges].offsetRanges
val lines: RDD[String] = kafkaRDD.map(e => (e.value()))
//解析json字符串
val rdd = lines.flatMap {
line =>
val parse_str = get_str_json(line);
val arr = parse_str.split("\t");
arr;
}.filter(word => word.nonEmpty).distinct(); //去空行
rdd.cache();
// imp oid,mid,wp,"imp","id"
val rdd_imp = rdd.flatMap(_.split("\t")).filter(_.indexOf("\"imp\"") > 0).map(_.split(","));
val rdd_imp1 = rdd_imp.map(word => (word(0), word(1)) -> (word(2).toDouble / 1000).formatted("%.6f").toDouble); // oid,mid->wp 价格是cpm,因此要除以1000
// 统计 cost
//val result_imp_cost = rdd_imp1.reduceByKey(_ + _).sortBy(_._2, false).map({ case ((oid, mid), wp) => (oid, mid) -> (wp.toDouble.formatted("%.6f").toDouble*1000000).toInt }); bidswitch数据是真实价格,没有*1000
val result_imp_cost = rdd_imp1.reduceByKey(_ + _).sortBy(_._2, false).map({ case ((oid, mid), wp) => (oid, mid) -> (wp.toDouble.formatted("%.6f").toDouble*1000).toInt });
// 统计 pv
val rdd_imp3 = rdd_imp1.map({ case ((oid, mid), wp) => (oid, mid) -> 1 }); // oid,mid->1
val result_imp_count = rdd_imp3.reduceByKey(_ + _).sortBy(_._2, false);// 统计pv
//统计 bid num oid,mid,"bid","id"
val rdd_bid = rdd.flatMap(_.split("\t")).distinct.filter(_.indexOf("\"bid\"") > 0).map(_.split(","));
val rdd_bid1 = rdd_bid.map(word => (word(0), word(1)) -> 1);
val result_bid_count = rdd_bid1.reduceByKey(_ + _).sortBy(_._2, false);
//统计 clk num oid,mid,"clk","id"
val rdd_clk = rdd.flatMap(_.split("\t")).distinct.filter(_.indexOf("\"clk\"") > 0).map(_.split(","));
val rdd_clk1 = rdd_clk.map(word => (word(0), word(1)) -> 1);
val result_clk_count = rdd_clk1.reduceByKey(_ + _).sortBy(_._2, false);
// join imp+bid+clk
val rdd_join1 = result_imp_cost.join(result_imp_count).map({ case ((oid, mid), (wp, pv)) => (oid, mid) -> (wp, pv) }); // ((oid,mid),(wp,pv))
val rdd_join2 = result_bid_count.leftOuterJoin(rdd_join1);
// (oid,mid),(bid,(wp,pv)) (wp,pv)Option[(Double,Int)]
val rdd_join3 = rdd_join2.map { case ((oid, mid), (bid, x)) =>
val (wp, pv) = x.getOrElse((0, 0));
(oid, mid) -> (bid, wp, pv)
}
val rdd_join4 = rdd_join3.fullOuterJoin(result_clk_count);
// (oid,mid),((bid,wp,pv),clk)
val rdd_join5 = rdd_join4.map { case ((oid, mid), (x, y)) =>
val (bid, wp, pv) = x.getOrElse((0, 0, 0));
(oid, mid, bid, wp, pv, y.getOrElse(0))
}
//"{"+oid+","+mid+","+bid+","+pv+","+clk+","+wp+"}" oid,mid,bid,wp,pv,clk .sortBy(f => (f._4), false).
val rdd_join6 = rdd_join5.map({ case (oid, mid, bid, wp, pv, clk) => (oid, mid, bid, pv, clk, wp) }); // oid,mid,bid,pv,clk,wp
// println("!!! ### after join = " + rdd_join6.count() + " , detail info: ");
println(" ********* statistic finished ********* "+df.format(now)+" ********* rdd_join6 = "+rdd_join6.count())
// 存入redis
rdd_join6.foreachPartition(partitionOfRecords => {
partitionOfRecords.foreach(pair => { // oid,mid,bid,pv,clk,wp
val conn: Jedis = JedisConnectionPools.getConnection()// 获取jedis连接
val oid = pair._1 // oid, mid, bid, pv, clk, wp
val bid = pair._3
val pv = pair._4
val clk = pair._5
val cost = pair._6
val HashKey = date_str+"::oid::"+oid
conn.hincrBy(HashKey, "cost", cost.toInt)
conn.hincrBy(HashKey, "pv", pv.toInt)
conn.hincrBy(HashKey, "bid", bid.toInt)
conn.hincrBy(HashKey, "clk", clk.toInt)
conn.close()
})
})
// println("### save to redis over ! time = "+df.format(now))
// 确保结果都已经正确且幂等地输出了
// stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
// println("### 更新offset ! time = "+df.format(now))
rdd.unpersist()
println(" ********* save to redis finished ***** "+df.format(now))
}
}
ssc.start()
ssc.awaitTermination()
}
}
redis实现:
import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}
object JedisConnectionPools {
val redisHost = "your redis host"
val redisPort = 6379
val redisTimeout = 30000
val conf = new JedisPoolConfig()
//最大连接数
conf.setMaxTotal(20)
//最大空闲连接数
conf.setMaxIdle(10)
//调用borrow Object方法时,是否进行有效检查
//conf.setTestOnBorrow(true)
//ip地址, redis的端口号,连接超时时间
val pool = new JedisPool(conf,"your redis host",6379,10000,"password")
def getConnection():Jedis={
pool.getResource
}
def main(args: Array[String]): Unit = {
val conn = JedisConnectionPools.getConnection()
val r1 = conn.keys("oid::mid::*")
println(r1)
conn.close()
}
}
intellij IDEA打jar包后,运行于cdh-master机器上,启动命令:
nohup spark-submit --master yarn --queue root.dev --deploy-mode client --jars /home/libs/jedis-2.9.0.jar,/home/libs/commons-pool2-2.0.jar --class com.XXX.KafkaBidStatistic /home/streaming/dspstreaming.jar > /home/logs/test.log 2>&1 &
说明:
1. 第一次使用streaming开发,网上查找了很多资料,都说要自己使用代码提交offset,但是,我没有手动提交,因为业务需求方面对数据的丢失不是很严格。手动提交需要每次记录offset并存放到如redis,然后,每次读取offset时候手动更新。
2. 采用Direct方式的streaming,如果程序坏了,重启会丢失之前程序坏了的时候的数据,需要使用checkpoint机制,但是本文暂时还没有实现。
基于direct的方式,使用kafka的简单api,Spark Streaming自己就负责追踪消费的offset,并保存在checkpoint中。Spark自己一定是同步的,因此可以保证数据是消费一次且仅消费一次。
3. 运行streaming程序后,可以在cdh界面上查看streaming运行情况,地址如下所示:
http://cdh-slaver4.lavapm:8088/proxy/application_1562297838757_0853/streaming/
scheduling delay:用来统计在等待被处理所消费的时间;
如果scheduling delay值在程序运行一段时间后,一直在递增,这就表明此系统不能对产生的数据实时响应,就是出现了处理时延,每个batch time 内的处理速度小于数据的产生速度。在这种情况下,需要想法减少数据的处理速度,即需要提升处理效率。可以增加kafka的topic的partition数量,因为其与streaming中的rdd的partition是一一对应的,就会并行读取数据,提高处理速度。