Flume+kafka+spark streaming+Redis实时统计广告投放的pv,uv,click,cost

因为业务逻辑的修改,投放数据存入大数据集群中,因此,需要修改之前的业务逻辑,需要实时知道rtb投放的花费情况。

环境版本:

spark: 2.11-2.4.0-cdh6.2.0

kafka: 2.1.0-cdh6.2.0

fluem: 1.9.0-cdh6.2.0

  • 1. Flume配置
a1.sources = r1

a1.sinks = k1

a1.channels = c1

#sources

a1.sources.r1.type = spooldir # 监控目录方式读取新增的日志文件

a1.sources.r1.recursiveDirectorySearch = true # flume在1.9版本中已实现递归监视子文件夹里的有新文件产生的功能

a1.sources.r1.spoolDir = /mnt/data1/flume_data/bid_bsw # flume监控的目录

a1.sources.r1.fileHeader = false

a1.sources.r1.ignorePattern = ^(.)*\\.[0|1]{1}$

a1.sources.r1.includePattern = ^bid.log.[0-9]{12}$

#channels

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

#sinks

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink

a1.sinks.k1.kafka.topic = statistic_bid_topic # 读取的topic

a1.sinks.k1.kafka.bootstrap.servers = 192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092

a1.sinks.k1.kafka.producer.acks=1

a1.sinks.k1.kafka.flumeBatchSize =20

a1.sinks.k1.channel = c1

a1.sources.r1.channels = c1

说明:

a1.sources.r1.spoolDir配置监控bid日志的目录;

a1.sources.r1.recursiveDirectorySearch配置此监控目录可以递归监控是否有新的文件产生;

a1.sinks.k1.kafka.bootstrap.servers配置kafka的Destination Broker List值

  • 2. kafka安装和配置

在cdh集群:操作->添加服务->Kafka进行相应添加

安装cdh-master,cdh-slaver1,cdh-slaver2作为kafka broker;

cdh-master作为Kafka MirrorMaker

Kafka配置:

zookeeper.chroot设置为/kafka

auto.create.topics.enable:勾选

delete.topic.enable:勾选

broker.id

log.dirs/XXX/kafka/data

注意:如果此路径的磁盘空间不足,kafka会报错关闭!

bootstrap.servers192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092  (安装kafka的服务器,如上图所示)

source.bootstrap.servers192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092

whitelist192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092

oom_heap_dump_dir/tmp

kafka设置topic命令:

到kafka的安装bin目录下,执行:

./kafka-topics.sh --create --topic statistic_bid_topic --zookeeper cdh-master.lavapm:2181,cdh-slaver1.lavapm:2181,cdh-slaver2.lavapm:2181/kafka --partitions 3 --replication-factor 2

设置了topic名字为statistic_bid_topic,设置了3个分区,2个副本;

查看所有的topic命令:

./kafka-topics.sh --zookeeper 192.168.1.65:2181/kafka --list

查看某个topic详细信息:

列出了statistic_bid_topic的parition数量、replica因子以及每个partition的leader、replica信息,命令为:

./kafka-topics.sh --zookeeper 192.168.1.65:2181/kafka --topic statistic_bid_topic --describe

查看consumer group列表

使用--list参数,命令为:

./kafka-consumer-groups.sh --bootstrap-server 192.168.1.65:9092 --list

查看特定consumer group 详情

使用--group--describe参数,运行命令:

./kafka-consumer-groups.sh --bootstrap-server 192.168.1.65:9092 --group statistic_bid_topic_Group –describe

接收消息,从头开始接收全部数据命令:

./kafka-console-consumer.sh --bootstrap-server 192.168.1.65:9092 --topic statistic_bid_topic --from-beginning

生产消息命令:

./kafka-console-producer.sh --broker-list cdh-master.lavapm:9092 --topic statistic_bid_topic

  • 3.spark streaming开发

实时读取kafka对应topic中的bid数据,一行一行解析json字符串,获取对应的imp,bid,click数据,然后做统计和合并,最后以增量方式存入redis中,因为是用于每天的投放统计,因此,存储的key的格式为:年月日::oid::推广单元id,如下图所示:

具体某个推广单元的详细数据如下图所示:

代码实现使用scala:

import java.text.SimpleDateFormat
import java.util.Date
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.HashPartitioner
import scala.util.parsing.json._
import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object KafkaBidStatistic extends Serializable {

  def regJson(json: Option[Any]) = json match {
    // 转换类型
    case Some(map: collection.immutable.Map[String, Any]) => map
  }

  /**
    * 解析多层json
    *
    * @param string_json
    * @return
    */
  def str_json(string_json: String): collection.immutable.Map[String, Any] = {
    var first: collection.immutable.Map[String, Any] = collection.immutable.Map()
    val jsonS = JSON.parseFull(string_json)
    //不确定数据的类型时,此处加异常判断
    if (jsonS.isInstanceOf[Option[Any]]) {
      first = regJson(jsonS);
    }
    first
  }

  def get_str_json(line: String): String = {
    var ret = "";
    try{
      var first: collection.immutable.Map[String, Any] = str_json(line);
      // 根据imp日志数据的特征,提取一种imp日志数据作为统计值
      // imp日志数据列子:{"cip":"223.24.172.199","mid":"687","pid":"","oid":"284","et":{"bid":1557196078,"imp":1557196080},"wd":{},"ip":"35.200.110.13","ov":1,"ex":20191211,"url":"","usc":["IAB18"],"uri":"\/imp?ext_data=b2lkPTI4NCxjaWQ9LHV1aWQ9VURhSlVmQXEybW1acmVpSTZzWVJCZSxtZT1iaWRzd2l0Y2gsZXg9Ymlkc3dpdGNoLHBjPTAsbGlkPSxuYXQ9MCx2cj0wLHV0PTE1NTcxOTYwNzgsZG9tPSxwaWQ9LG9jYz1jaW0sY2NwPTU2MDAwMCxhaWQ9Njg3LHk9MCxhdj0w&ver=1&reqid=bidswitch&price=0.55992","occ":"cim","me":"bidswitch","mobile":{"id":"","os":2},"ua":"BidSwitch\/1.0","id":"UDaJUfAq2mmZreiI6sYRBe","op":"","ccp":1,"ins":{},"sz":"320x50","reuse":1,"wp":0.55992,"is_mobile":1,"nwp":1,"ev":"imp","rg":1840000000,"did":"CDz2mRv7s7fffMuVY3eeuy"}
      //  && first.contains("uri") 国内投放没有此值!
      if (first("ev").toString == "imp" && first.contains("me")) { // first.getOrElse("me" , 1)   && first("me").toString=="bidswitch"
        val wp = first("wp");
        val mid = first("mid").toString;
        val oid = first("oid").toString;
        val ev = first("ev").toString;
        val id = first("id").toString;
        ret = oid + "," + mid + "," + wp + ",\"" + ev + "\",\"" + id + "\""; //  oid,mid,wp,"imp","id"
      }
      else if (first("ev").toString == "bid") {
        val mid = first("mid").toString;
        val oid = first("oid").toString;
        val id = first("id").toString;
        val ev = first("ev").toString;
        ret = oid + "," + mid + ",\"" + ev + "\",\"" + id + "\""; //  oid,mid,"bid","id"
      }
      else if (first("ev").toString == "clk") {
        val mid = first("mid").toString;
        val oid = first("oid").toString;
        val id = first("id").toString;
        val ev = first("ev").toString;
        ret = oid + "," + mid + ",\"" + ev + "\",\"" + id + "\""; //  oid,mid,"clk","id"
      }
    }
    catch{
      case ex: Exception => {
        ex.printStackTrace() // 打印到标准err
        System.err.println("get_str_json()解析字符串异常! line = "+line)  // 打印到标准err
      }
    }
    ret;
  }


  def main(args: Array[String]): Unit = {
  //  Logger.getLogger("org.apache.spark").setLevel(Level.WARN);

    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.ERROR)
    Logger.getLogger("org.apache.kafka.clients.consumer").setLevel(Level.ERROR)

    val conf = new SparkConf();
    conf.setMaster("yarn").setAppName("statisticBid2Redis")
	//每隔2秒监测一次是否有数据
    val ssc = new StreamingContext(conf, Seconds.apply(2))    
    val ssce = SparkSession.builder().config(conf).getOrCreate

    // 提高Job并发数
    //  spark.conf.set("spark.streaming.concurrentJobs", 10)
    // 获取topic分区leaders及其最新offsets时,调大重试次数。 Receiver模式下使用
    //spark.conf.set("spark.streaming.kafka.maxRetries", 50)
    //开启反压
    ssce.conf.set("spark.streaming.backpressure.enabled",true)
    //确保在kill任务时,能够处理完最后一批数据,再关闭程序,不会发生强制kill导致数据处理中断,没处理完的数据丢失
    ssce.conf.set("spark.streaming.stopGracefullyOnShutdown",true)
    //反压机制时初始化的摄入量,该参数只对receiver模式起作用,并不适用于direct模式
    ssce.conf.set("spark.streaming.backpressure.initialRate",5000)
    // 设置每秒每个分区最大获取日志数,控制处理数据量,保证数据均匀处理。
    // 每次作业中每个,Kafka分区最多读取的记录条数。可以防止第一个批次流量出现过载情况,也可以防止批次处理出现数据流量过载情况
    ssce.conf.set("spark.streaming.kafka.maxRatePerPartition",3000)

    val sc = ssce.sparkContext;
    println("********* spark.default.parallelism *********"+sc.defaultParallelism)

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "192.168.1.65:9092,192.168.1.66:9092,192.168.1.67:9092", // kafka 集群
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "statistic_bid_topic_Group", //  statistic_bid_topic_Group
      "auto.offset.reset" -> "latest", // 1.earliest 每次都是从头开始消费(from-beginning)2.latest 消费最新消息 3.smallest 从最早的消息开始读取
      "enable.auto.commit" -> (true: java.lang.Boolean) //true , false=手动提交offset
    )

   val topics = Array("statistic_bid_topic") //主题,可配置多个  

    //Direct方式
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    stream.foreachRDD { kafkaRDD =>
      if (!kafkaRDD.isEmpty()) {
        val now: Date = new Date();
        val dateFormat: SimpleDateFormat = new SimpleDateFormat("yyyyMMdd");
        val date_str = dateFormat.format(now);
        val df: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");//yyyyMMddHHmmss

      //  val offsetRanges = kafkaRDD.asInstanceOf[HasOffsetRanges].offsetRanges

        val lines: RDD[String] = kafkaRDD.map(e => (e.value()))

        //解析json字符串
        val rdd = lines.flatMap {
          line =>
            val parse_str = get_str_json(line);
            val arr = parse_str.split("\t");
            arr;
        }.filter(word => word.nonEmpty).distinct(); //去空行

        rdd.cache();

        // imp      oid,mid,wp,"imp","id"
        val rdd_imp = rdd.flatMap(_.split("\t")).filter(_.indexOf("\"imp\"") > 0).map(_.split(","));
        val rdd_imp1 = rdd_imp.map(word => (word(0), word(1)) -> (word(2).toDouble / 1000).formatted("%.6f").toDouble); // oid,mid->wp    价格是cpm,因此要除以1000

        // 统计 cost
        //val result_imp_cost = rdd_imp1.reduceByKey(_ + _).sortBy(_._2, false).map({ case ((oid, mid), wp) => (oid, mid) -> (wp.toDouble.formatted("%.6f").toDouble*1000000).toInt }); bidswitch数据是真实价格,没有*1000
        val result_imp_cost = rdd_imp1.reduceByKey(_ + _).sortBy(_._2, false).map({ case ((oid, mid), wp) => (oid, mid) -> (wp.toDouble.formatted("%.6f").toDouble*1000).toInt });

        // 统计 pv
        val rdd_imp3 = rdd_imp1.map({ case ((oid, mid), wp) => (oid, mid) -> 1 }); // oid,mid->1
        val result_imp_count = rdd_imp3.reduceByKey(_ + _).sortBy(_._2, false);// 统计pv

        //统计 bid num   oid,mid,"bid","id"
        val rdd_bid = rdd.flatMap(_.split("\t")).distinct.filter(_.indexOf("\"bid\"") > 0).map(_.split(","));
        val rdd_bid1 = rdd_bid.map(word => (word(0), word(1)) -> 1);
        val result_bid_count = rdd_bid1.reduceByKey(_ + _).sortBy(_._2, false);

        //统计 clk num  oid,mid,"clk","id"
        val rdd_clk = rdd.flatMap(_.split("\t")).distinct.filter(_.indexOf("\"clk\"") > 0).map(_.split(","));
        val rdd_clk1 = rdd_clk.map(word => (word(0), word(1)) -> 1);
        val result_clk_count = rdd_clk1.reduceByKey(_ + _).sortBy(_._2, false);

        // join imp+bid+clk
        val rdd_join1 = result_imp_cost.join(result_imp_count).map({ case ((oid, mid), (wp, pv)) => (oid, mid) -> (wp, pv) }); // ((oid,mid),(wp,pv))
        val rdd_join2 = result_bid_count.leftOuterJoin(rdd_join1);
        // (oid,mid),(bid,(wp,pv))  (wp,pv)Option[(Double,Int)]
        val rdd_join3 = rdd_join2.map { case ((oid, mid), (bid, x)) =>
          val (wp, pv) = x.getOrElse((0, 0));
          (oid, mid) -> (bid, wp, pv)
        }
        val rdd_join4 = rdd_join3.fullOuterJoin(result_clk_count);
        // (oid,mid),((bid,wp,pv),clk)
        val rdd_join5 = rdd_join4.map { case ((oid, mid), (x, y)) =>
          val (bid, wp, pv) = x.getOrElse((0, 0, 0));
          (oid, mid, bid, wp, pv, y.getOrElse(0))
        }
        //"{"+oid+","+mid+","+bid+","+pv+","+clk+","+wp+"}"   oid,mid,bid,wp,pv,clk   .sortBy(f => (f._4), false).
        val rdd_join6 = rdd_join5.map({ case (oid, mid, bid, wp, pv, clk) => (oid, mid, bid, pv, clk, wp) }); // oid,mid,bid,pv,clk,wp
        // println("!!! ### after join = " + rdd_join6.count() + " , detail info: ");

        println(" ********* statistic finished ********* "+df.format(now)+" ********* rdd_join6 = "+rdd_join6.count())

        // 存入redis
        rdd_join6.foreachPartition(partitionOfRecords => {
          partitionOfRecords.foreach(pair => { // oid,mid,bid,pv,clk,wp
            val conn: Jedis = JedisConnectionPools.getConnection()// 获取jedis连接
            val oid = pair._1 // oid, mid, bid, pv, clk, wp
            val bid = pair._3
            val pv = pair._4
            val clk = pair._5
            val cost = pair._6
            val HashKey = date_str+"::oid::"+oid           
            conn.hincrBy(HashKey, "cost", cost.toInt)
            conn.hincrBy(HashKey, "pv", pv.toInt)
            conn.hincrBy(HashKey, "bid", bid.toInt)
            conn.hincrBy(HashKey, "clk", clk.toInt)          
            conn.close()
          })
        })

       // println("### save to redis over ! time = "+df.format(now))
       // 确保结果都已经正确且幂等地输出了
       // stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
       // println("### 更新offset ! time = "+df.format(now))

        rdd.unpersist()
        println(" ********* save to redis finished ***** "+df.format(now))

      }
    }
    ssc.start()
    ssc.awaitTermination()
  }

}

redis实现:

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object JedisConnectionPools {
  val redisHost = "your redis host"
  val redisPort = 6379
  val redisTimeout = 30000

  val conf = new JedisPoolConfig()
  
  //最大连接数
  conf.setMaxTotal(20)
  
  //最大空闲连接数
  conf.setMaxIdle(10)
  
  //调用borrow Object方法时,是否进行有效检查
  //conf.setTestOnBorrow(true)

  //ip地址, redis的端口号,连接超时时间
  val pool = new JedisPool(conf,"your redis host",6379,10000,"password")

  def getConnection():Jedis={
    pool.getResource
  }

  def main(args: Array[String]): Unit = {
    val  conn = JedisConnectionPools.getConnection()   
    val r1 = conn.keys("oid::mid::*") 
    println(r1)
    conn.close()   
  }

}

 

intellij IDEA打jar包后,运行于cdh-master机器上,启动命令:

nohup spark-submit --master yarn --queue root.dev --deploy-mode client --jars /home/libs/jedis-2.9.0.jar,/home/libs/commons-pool2-2.0.jar --class com.XXX.KafkaBidStatistic /home/streaming/dspstreaming.jar > /home/logs/test.log 2>&1 &

说明:

1. 第一次使用streaming开发,网上查找了很多资料,都说要自己使用代码提交offset,但是,我没有手动提交,因为业务需求方面对数据的丢失不是很严格。手动提交需要每次记录offset并存放到如redis,然后,每次读取offset时候手动更新。

2. 采用Direct方式的streaming,如果程序坏了,重启会丢失之前程序坏了的时候的数据,需要使用checkpoint机制,但是本文暂时还没有实现。

基于direct的方式,使用kafka的简单api,Spark Streaming自己就负责追踪消费的offset,并保存在checkpoint中。Spark自己一定是同步的,因此可以保证数据是消费一次且仅消费一次。

3. 运行streaming程序后,可以在cdh界面上查看streaming运行情况,地址如下所示:

http://cdh-slaver4.lavapm:8088/proxy/application_1562297838757_0853/streaming/

scheduling delay:用来统计在等待被处理所消费的时间;

如果scheduling delay值在程序运行一段时间后,一直在递增,这就表明此系统不能对产生的数据实时响应,就是出现了处理时延,每个batch time 内的处理速度小于数据的产生速度。在这种情况下,需要想法减少数据的处理速度,即需要提升处理效率。可以增加kafka的topic的partition数量,因为其与streaming中的rdd的partition是一一对应的,就会并行读取数据,提高处理速度。

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值