我的大数据之路 -- flume+kafka+spark streaming+hdfs

最新推荐文章于 2020-08-24 09:08:28 发布

小牛头#

最新推荐文章于 2020-08-24 09:08:28 发布

阅读量995

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_41562377/article/details/89603066

版权

大数据专栏收录该内容

38 篇文章 1 订阅

订阅专栏

小综合实战

思路如下
在这里插入图片描述

一、建立数据集
去网上下载一篇英语作文

vi Chinese_Dream.txt
-----------------------------------------------
Many years ago, when China was poor and lagged much behind the world, a lot of men went to California to seek for gold in the hope that they could be rich when they returned to hometown. American dream once influenced the world and it attracted people to realize their dreams. But today, the future is in China. Many young people come to China to find their dreams.
许多年前，中国很贫穷，远远落后于世界时，很多人都去加州寻找金子，他们希望可以带着财富回来。美国梦曾经影响了全世界，它吸引人们去实现他们的梦想，但是现在，未来在中国，许多年轻人都来中国寻找他们的梦想。

Chinese economy developed very fast in the last decades, the market is full of vitality and booming. More and more students choose to study abroad, but most of them decide to return China to start their business, which is very different with ten years ago. At that time, finding jobs in foreign countries was their target. What's more, China attracts foreigners to come and seek for cooperation, because they know clearly that China can provide the chance they need.
中国经济在过去的几十年里快速发展，市场充满了活力且蓬勃发展。越来越多的学生选择出国留学，但是他们中的大多数人都决定回国发展，这与10年前的情况很不一样的，那个时候，在国外找到工作就是他们的目标。而且，中国还吸引着外国人来寻求合作，因为他们很清楚，中国可以提供他们所需要的机会。

Mandarin is learned by people from all around the world. The newest report shows that more than 100 million people learn Chinese, which is the world's second language, and only ranks behind English. Chinese dream helps them to have more chances.
来自世界各地的人都在学习普通话。最新的报告显示，超过1亿人在学习中文，中文是世界第二语言，仅排在英语之后。中国梦帮助他们获得更多的机会。

二、Flume采集数据目录

vi FKChinese_Dream.conf
----------------------------------------------------------------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
#具体定义source
a1.sources.r1.type = spooldir
#先创建此目录，保证里面空的
a1.sources.r1.spoolDir = /user/Chinese_Dream
#sink到kafka里面
a1.sinks.k1.channel = c1
a1.sinks.k1.type =org.apache.flume.sink.kafka.KafkaSink
#设置Kafka的Topic,事先创建好
a1.sinks.k1.topic =dream
#设置Kafka的broker地址和端口号
a1.sinks.k1.brokerList = Master:9092
#配置批量提交的数量
a1.sinks.k1.flumeBatchSize = 20
a1.sinks.k1.producer.acks = 1
a1.sinks.k1.producer.linger.ms = 1
a1.sinks.ki.producer.compression.type= snappy
 
#对于channel的配置描述 使用文件做数据的临时缓存 这种的安全性要高
a1.channels.c1.type = file

#事先创建好
a1.channels.c1.checkpointDir = /user/flume-1.6/checkpoint
a1.channels.c1.dataDirs = /user/flume-1.6/data
 
#通过channel c1将source r1和sink k1关联起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

三、将采集到的数据集放置到kafka中，已配置好环境变量

创建topic —> dream

kafka-topics.sh --create --zookeeper Master:2181 --replication-factor 1 --partitions 1 --topic dream

查看topic

kafka-topics.sh --list --zookeeper Master:2181

在这里插入图片描述

四、使用python3编写spark streaming程序

vi dream.py

```
##-- coding: UTF-8 --

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

#设置使用两个线程，设置程序的名字为KafkaWordCount
sc=SparkContext("local[2]","FKChinese_Dream")

#处理时间间隔为2s
ssc=StreamingContext(sc,2)

#设置zookeeper
zookeeper="192.168.23.200:2181,192.168.23.201:2181,192.168.23.202:2181"


#设置要监听的主题
topic={"dream":0,"dream":1,"dream":2}

#在/user/kafka/config/consumer.properties 查看groupid="test-consumer-group"
groupids="test-consumer-group"


'''
构造函数为KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic,partitions] )
使用了receivers来接收数据，利用的是Kafka高层次的消费者api，
对于所有的receivers接收到的数据将会保存在spark executors中，
然后通过Spark Streaming启动job来处理这些数据，默认会丢失，可启用WAL日志，该日志存储在HDFS上
'''

lines=KafkaUtils.createStream(ssc,zookeeper,groupids,topic)

lines_map=lines.map(lambda x:x[1])

#对两秒内接收到的数据按空格分割
words=lines_map.flatMap(lambda line:line.split(" "))

#映射为（word，1）元组
pairs=words.map(lambda word:(word,1))

#每一行相同的key相加
wordcounts=pairs.reduceByKey(lambda x,y:x+y)

#自动创建，输出文件到hdfs，前缀+自动加日期
wordcounts.saveAsTextFiles("/user/dream/dream_logs")
wordcounts.pprint()

#启动spark streaming
ssc.start()

#等待计算终止
ssc.awaitTermination()

```

五、启动集群

一shell启动zookeeper
```
sh zookeeper_start.sh
```
一shell启动kafka
```
sh kafka_start.sh
```
一令启动hdfs
```
start-all.sh
```
一令启动spark，在spark的conf目录下
```
sbin/start-all.sh 
```
JPS

六、启动Flume

启动Flume，图片为放入文件之后的结果

bin/flume-ng agent --conf /user/flume-1.6/conf --conf-file /user/flume-1.6/conf/FKChinese_Dream.conf --name a1 -Dflume.root.logger=INFO,console

在这里插入图片描述

启动kafka，放入文件之后的结果

kafka-console-consumer.sh -zookeeper Slave1:2181 --from-beginning --topic dream

在这里插入图片描述

启动spark streaming dream.py

spark-submit --jars /user/spark/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.3.jar /user/dream.py 2> error.txt

在这里插入图片描述

把文件放到flume监控目录Chinese_Dream，在master主机上。我的是在Slave2 ssh到Master上的。
```
cp Chinese_Dream.txt Chinese_Dream
```
查看hdfs上的文件
根据文件的时间转换成时间戳，在网上找个在线工具转换

1)hadoop fs -ls /user/dream/

2)hadoop fs -ls /user/dream/dream_logs-1556503120000 加多000

3)hadoop fs -cat /user/dream/dream_logs-1556503120000/*