实时项目统计实战之二
更多干货
- 分布式实战(干货)
- spring cloud 实战(干货)
- mybatis 实战(干货)
- spring boot 实战(干货)
- React 入门实战(干货)
- 构建中小型互联网企业架构(干货)
- python 学习持续更新
- ElasticSearch 笔记
- kafka storm 实战 (干货)
一、概述
简单代码事例
https://github.com/csy512889371/learndemo/tree/master/spark-streaming-kafka
二、启动和配置 flum kafka
1、Zookeeper
运行 kafka 需要使用 Zookeeper,所以你需要先启动 Zookeeper
for host in s201 s202 s203;
do
ssh $host "source /etc/profile;/soft/zk/bin/zkServer.sh start";
done
for host in s201 s202 s203;
do
ssh $host "source /etc/profile;/soft/zk/bin/zkServer.sh status"
done
2、启动kafka
启动 kafka
bin/kafka-server-start.sh config/server.properties &
bin/kafka-server-stop.sh config/server.properties &
通过 list 命令查看创建的 topic:
创建一个叫做“flume”的 topic,它只有一个分区,一个副本。
bin/kafka-topics.sh --list --zookeeper s201:2181
创建一个消费者:
bin/kafka-topics.sh --create --zookeeper s201:2181 --replication-factor 1 --partitions 1 --topic flumeTopic
启动 Kafka consumer:
bin/kafka-console-consumer.sh --zookeeper s201:2181 --topic flumeTopic --from-beginning
3、Flume
启动 Flume
bin/flume-ng agent --conf conf --conf-file conf/a1.conf --name a1 -Dflume.root.logger=INFO,console
Flume 配置
vi conf/a1.conf
# 定义 agent
a1.sources = src1
a1.channels = ch1
a1.sinks = k1
# 定义 sources
a1.sources.src1.type = exec
a1.sources.src1.command=tail -F /home/centos/log/log
a1.sources.src1.channels=ch1
# 定义 sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = flumeTopic
a1.sinks.k1.brokerList = s201:9092
a1.sinks.k1.batchSize = 20
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.channel = ch1
# 定义 channels
a1.channels.ch1.type = memory
a1.channels.ch1.capacity = 1000
上面 配置文件中提到的默认连接到一个名为 ‘default-flume-topic’ 的 topic" ,
实际上是在 flume-ng-kafka-sink 项目中定义的, 如果需要修改默认名称等属性,可以修改 Constants 类。
flume-ng-kafka-sink/impl/src/main/java/com/polaris/flume/sink/下面的
Constants.java
public static final String DEFAULT_TOPIC = "default-flume-topic";
开发环境搭建
添加 kafka 的依赖
<properties>
<scala.version>2.11.8</scala.version>
<kafka.version>0.10.0.0</kafka.version>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.7.2</hadoop.version>
<hbase.version>1.2.0</hbase.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!--kafka依赖-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
打通 Flume&Kafka&Spark Streaming
在 spark 应用程序接收到数据并完成记录数统计
代码为 0.8 版本的
/**
* 使用 sparkStreaming 处理 Kafka 过来数据
*/
object StatStreamingApp {
def main(args: Array[String]): Unit = {
if(args.length != 4){
println("Usage:StatStreamingApp<zkQuorum><group><topic>
<numberThread>")
System.exit(1)
}
val Array(zkQuorum,groupId,topis,numberThread) = args
val sparkConf = new
SparkConf().setMaster("local[4]").setAppName("StatStreamingApp")
val ssc = new StreamingContext(sparkConf,Seconds(60))
val topicMap = topis.split(",").map((_,numberThread.toInt)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)
//测试接收数据
messages.map(_._2).count().print()
}
}
kafka0.10 之后的代码,目前项目中用到此版本
import com.study.spark.dao.{CategaryClickCountDAO, CategarySearchClickCountDAO}
import com.study.spark.domain.{CategaryClickCount, CategarySearchClickCount, ClickLog}
import com.study.spark.project.util.DataUtils
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import scala.collection.mutable.ListBuffer
object StatStreamingApp {
def main(args: Array[String]): Unit = {
val ssc = new StreamingContext("local[*]", "StatStreamingApp", Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "s201:9092,s201:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("flumeTopic")
val logs = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
).map(_.value())
//156.187.29.132 2017-11-20 00:39:26 "GET /www/2 HTTP/1.0" - 200
var cleanLog = logs.map(line=>{
var infos = line.split("\t")
var url = infos(2).split(" ")(1)
var categaryId = 0
if(url.startsWith("www")){
categaryId = url.split("/")(1).toInt
}
ClickLog(infos(0),DataUtils.parseToMin(infos(1)),categaryId,infos(3),infos(4).toInt)
}).filter(log=>log.categaryId!=0)
cleanLog.print()
//每个类别的每天的点击量 (day_categaryId,1)
cleanLog.map(log=>{
(log.time.substring(0,8)+log.categaryId,1)
}).reduceByKey(_+_).foreachRDD(rdd=>{
rdd.foreachPartition( partitions=>{
val list = new ListBuffer[CategaryClickCount]
partitions.foreach(pair=>{
list.append(CategaryClickCount(pair._1,pair._2))
})
CategaryClickCountDAO.save(list)
})
})
//每个栏目下面从渠道过来的流量20171122_www.baidu.com_1 100 20171122_2(渠道)_1(类别) 100
//categary_search_count create "categary_search_count","info"
//124.30.187.10 2017-11-20 00:39:26 "GET www/6 HTTP/1.0"
// https:/www.sogou.com/web?qu=我的体育老师 302
cleanLog.map(log=>{
val url = log.refer.replace("//","/")
val splits = url.split("/")
var host =""
if(splits.length > 2){
host=splits(1)
}
(host,log.time,log.categaryId)
}).filter(x=>x._1 != "").map(x=>{
(x._2.substring(0,8)+"_"+x._1+"_"+x._3,1)
}).reduceByKey(_+_).foreachRDD(rdd=>{
rdd.foreachPartition(partions=>{
val list = new ListBuffer[CategarySearchClickCount]
partions.foreach(pairs=>{
list.append(CategarySearchClickCount(pairs._1,pairs._2))
})
CategarySearchClickCountDAO.save(list)
})
})
ssc.start()
ssc.awaitTermination()
}
}