一、需求
模拟一个流式处理场景:
我再说话,我编写好的一个spark streaming做词频统计
1.模拟说话 : nc -lk 3399
flume
source :avro(qyl01:3399)
channel:memory
sink : kafkasink
模拟实时的日志生成:
echo aa bb cc >> /home/qyl/logs/flume.log
flume
source:exec(tail -f)
channel:memory
sink:kafkasink
2.检测整个流程的流通性:
1.说完一句话
2.flume收集数据到kafka中
3.SparkStreaming做实时统计wordcount,将结果输出到hdfs上,以便后续处理。
二、步骤
1、启动集群
zookeeper hdfs yarn
zkServer.sh start
start-hdfs.sh
start-yarn.sh
2、启动flume进行收集
在此之前先构建flmue采集框架,确定采集方式。
exec-avro.conf:从日志中收集文件
agent1.sources = r1
agent1.channels = c1
agent1.sinks = k1
#define sources
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /home/qyl/logs/flume.log
#define channels
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
#define sink
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname =qyl03
agent1.sinks.k1.port = 44444
#bind sources and sink to channel
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
avro-kafka.conf:将得到的文件输出到kafka中的一个topic
agent1.sources = r1
agent1.channels = c1
agent1.sinks = k1
#define sources
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /home/qyl/logs/flume.log
#define channels
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
#define sink
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname =qyl03
agent1.sinks.k1.port = 44444
#bind sources and sink to channel
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
[qyl@qyl03 agentconf]$ cat avro-kafka.conf
agent2.sources = r2
agent2.channels = c2
agent2.sinks = k2
#define sources
agent2.sources.r2.type = avro
agent2.sources.r2.bind = qyl03
agent2.sources.r2.port = 44444
#define channels
agent2.channels.c2.type = memory
agent2.channels.c2.capacity = 1000
agent2.channels.c2.transactionCapacity = 100
#define sink
agent2.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
agent2.sinks.k2.brokerList = qyl01:9092,qyl02:9092,qyl03:9092
agent2.sinks.k2.topic = flume-kafka
agent2.sinks.k2.batchSize = 4
agent2.sinks.k2.requiredAcks = 1
#bind sources and sink to channel
agent2.sources.r2.channels = c2
agent2.sinks.k2.channel = c2
启动 agent2
flume-ng agent \
--conf /home/qyl/apps/apache-flume-1.8.0-bin/conf/ \
--name agent2 \
--conf-file /home/qyl/apps/apache-flume-1.8.0-bin/agentconf/avro-kafka.conf \
-Dflume.root.logger=DEBUG,console
启动agent1
flume-ng agent \
--conf /home/qyl/apps/apache-flume-1.8.0-bin/conf/ \
--name agent1 \
--conf-file /home/qyl/apps/apache-flume-1.8.0-bin/agentconf/exec-avro.conf \
-Dflume.root.logger=DEBUG,console
3、启动kafka进行生产和消费(kafka的相关命令)
1、启动kafka
bin/kafka-server-start.sh config/server.properties >>/dev/null 2>&1 &
2、创建一个topic
bin/kafka-topics.sh --create --topic flume-kafka \
--partitions 3 \
--replication-factor 3 \
--zookeeper qyl01:2181,qyl02:2181,qyl03:2181
3、查看所有的Topic列表:
bin/kafka-topics.sh --list --zookeeper qyl01:2181,qyl02:2181,qyl03:2181
4、查看"mytopic"这个Topic的详细信息:
bin/kafka-topics.sh --describe --topic mytopic --zookeeper
qyl01:2181,qyl02:2181,qyl03:2181
5、启动一个kafka consumer
kafka-console-consumer.sh \
--bootstrap-server qyl01:9092,qyl02:9092,qyl03:9092 \
--from-beginning \
--topic flume-kafka
6、启动一个kafka producer
bin/kafka-console-producer.sh --topic
flume-kafka --broker-list qyl01:9092
4、编写sparkStreaming代码对kafka中生产的数据进行实时处理
代码如下:
object SparkStreaming_kafka_2 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.project-spark").setLevel(Level.WARN)
val sparkconf = new SparkConf().setAppName("SparkStreaming_kafka_2").setMaster("local[2]")
val streamingcontext = new StreamingContext(sparkconf,Seconds(4))
streamingcontext.checkpoint("c:/ssk1807")
// 从kafka中接受数据
val kafkaParms=Map("metadata.broker.list"->"qyl01:9092,qyl02:9092,qyl03:9092")
val kafkaStreams: DStream[String] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](streamingcontext,
kafkaParms,
Set("flume-kafka")
).map(_._2)
val resultDStream = kafkaStreams.flatMap(_.split(" ")).map((_, 1)).updateStateByKey(
(newValue: Seq[Int], state: Option[Int]) => {
Some(newValue.sum + state.getOrElse(0))
}
)
resultDStream.print()
streamingcontext.start()
streamingcontext.awaitTermination()
}
}