实时项目统计实战之二

实时项目统计实战之二

更多干货

一、概述

简单代码事例

https://github.com/csy512889371/learndemo/tree/master/spark-streaming-kafka

二、启动和配置 flum kafka

1、Zookeeper

运行 kafka 需要使用 Zookeeper,所以你需要先启动 Zookeeper

for host in s201 s202 s203;
do
ssh $host "source /etc/profile;/soft/zk/bin/zkServer.sh start";
done
for host in s201 s202 s203;
do
ssh $host "source /etc/profile;/soft/zk/bin/zkServer.sh status" 
done

2、启动kafka

启动 kafka

bin/kafka-server-start.sh config/server.properties &
bin/kafka-server-stop.sh config/server.properties &

通过 list 命令查看创建的 topic:

创建一个叫做“flume”的 topic,它只有一个分区,一个副本。

bin/kafka-topics.sh --list --zookeeper s201:2181

创建一个消费者:

bin/kafka-topics.sh --create --zookeeper s201:2181 --replication-factor 1 --partitions 1 --topic flumeTopic

启动 Kafka consumer:

bin/kafka-console-consumer.sh --zookeeper s201:2181 --topic flumeTopic --from-beginning

3、Flume

启动 Flume

bin/flume-ng agent --conf conf --conf-file conf/a1.conf --name a1 -Dflume.root.logger=INFO,console

Flume 配置

vi conf/a1.conf

# 定义 agent
a1.sources = src1
a1.channels = ch1
a1.sinks = k1
# 定义 sources
a1.sources.src1.type = exec
a1.sources.src1.command=tail -F /home/centos/log/log
a1.sources.src1.channels=ch1
# 定义 sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = flumeTopic
a1.sinks.k1.brokerList = s201:9092
a1.sinks.k1.batchSize = 20
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.channel = ch1
# 定义 channels
a1.channels.ch1.type = memory
a1.channels.ch1.capacity = 1000

上面 配置文件中提到的默认连接到一个名为 ‘default-flume-topic’ 的 topic" ,

实际上是在 flume-ng-kafka-sink 项目中定义的, 如果需要修改默认名称等属性,可以修改 Constants 类。

flume-ng-kafka-sink/impl/src/main/java/com/polaris/flume/sink/下面的

Constants.java
public static final String DEFAULT_TOPIC = "default-flume-topic";

开发环境搭建


添加 kafka 的依赖

<properties>
    <scala.version>2.11.8</scala.version>
    <kafka.version>0.10.0.0</kafka.version>
    <spark.version>2.2.0</spark.version>
    <hadoop.version>2.7.2</hadoop.version>
    <hbase.version>1.2.0</hbase.version>
  </properties>



  <dependencies>

    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <!--kafka依赖-->
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka_2.11</artifactId>
      <version>0.10.0.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-client</artifactId>
      <version>1.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-server</artifactId>
      <version>1.2.0</version>
    </dependency>

    <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.1.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>2.1.0</version>
    </dependency>

  </dependencies>

打通 Flume&Kafka&Spark Streaming

image

image

在 spark 应用程序接收到数据并完成记录数统计


代码为 0.8 版本的

/**
* 使用 sparkStreaming 处理 Kafka 过来数据
*/
object StatStreamingApp {
	def main(args: Array[String]): Unit = {
		if(args.length != 4){
			println("Usage:StatStreamingApp<zkQuorum><group><topic>
			<numberThread>")
			System.exit(1)
		}
		val Array(zkQuorum,groupId,topis,numberThread) = args
		val sparkConf = new
		SparkConf().setMaster("local[4]").setAppName("StatStreamingApp")
		val ssc = new StreamingContext(sparkConf,Seconds(60))
		val topicMap = topis.split(",").map((_,numberThread.toInt)).toMap
		val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)
		//测试接收数据
		messages.map(_._2).count().print()
	}
}

image

kafka0.10 之后的代码,目前项目中用到此版本


import com.study.spark.dao.{CategaryClickCountDAO, CategarySearchClickCountDAO}
import com.study.spark.domain.{CategaryClickCount, CategarySearchClickCount, ClickLog}
import com.study.spark.project.util.DataUtils
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

import scala.collection.mutable.ListBuffer


object StatStreamingApp {

    def main(args: Array[String]): Unit = {
        val ssc = new StreamingContext("local[*]", "StatStreamingApp", Seconds(5))

        val kafkaParams = Map[String, Object](
            "bootstrap.servers" -> "s201:9092,s201:9092",
            "key.deserializer" -> classOf[StringDeserializer],
            "value.deserializer" -> classOf[StringDeserializer],
            "group.id" -> "test",
            "auto.offset.reset" -> "latest",
            "enable.auto.commit" -> (false: java.lang.Boolean)
        )

        val topics = Array("flumeTopic")
        val logs = KafkaUtils.createDirectStream[String, String](
            ssc,
            PreferConsistent,
            Subscribe[String, String](topics, kafkaParams)
        ).map(_.value())

       //156.187.29.132	2017-11-20 00:39:26	"GET /www/2 HTTP/1.0"	-	200

        var cleanLog = logs.map(line=>{
            var infos = line.split("\t")
            var url = infos(2).split(" ")(1)
            var categaryId = 0
            if(url.startsWith("www")){
                categaryId = url.split("/")(1).toInt
            }
           ClickLog(infos(0),DataUtils.parseToMin(infos(1)),categaryId,infos(3),infos(4).toInt)
        }).filter(log=>log.categaryId!=0)

        cleanLog.print()
        //每个类别的每天的点击量 (day_categaryId,1)
        cleanLog.map(log=>{
            (log.time.substring(0,8)+log.categaryId,1)
        }).reduceByKey(_+_).foreachRDD(rdd=>{
            rdd.foreachPartition( partitions=>{
                val list = new ListBuffer[CategaryClickCount]
                partitions.foreach(pair=>{
                    list.append(CategaryClickCount(pair._1,pair._2))
                })
                CategaryClickCountDAO.save(list)
            })
        })
        //每个栏目下面从渠道过来的流量20171122_www.baidu.com_1 100 20171122_2(渠道)_1(类别) 100
        //categary_search_count   create "categary_search_count","info"
        //124.30.187.10	2017-11-20 00:39:26	"GET www/6 HTTP/1.0"
        // 	https:/www.sogou.com/web?qu=我的体育老师	302
        cleanLog.map(log=>{
           val url = log.refer.replace("//","/")
           val splits =   url.split("/")
            var host =""
            if(splits.length > 2){
               host=splits(1)
            }
            (host,log.time,log.categaryId)
        }).filter(x=>x._1 != "").map(x=>{
            (x._2.substring(0,8)+"_"+x._1+"_"+x._3,1)
        }).reduceByKey(_+_).foreachRDD(rdd=>{
            rdd.foreachPartition(partions=>{
                val list = new ListBuffer[CategarySearchClickCount]
                partions.foreach(pairs=>{
                    list.append(CategarySearchClickCount(pairs._1,pairs._2))
                })
               CategarySearchClickCountDAO.save(list)
            })
        })

        ssc.start()
        ssc.awaitTermination()
    }

}

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值