离线电商数仓-用户行为采集平台-第4章用户行为数据采集模块

书墨客

已于 2022-10-20 23:26:17 修改

阅读量433

点赞数 1

分类专栏：数据仓库文章标签： kafka hadoop 大数据 zookeeper flume

于 2022-10-20 23:21:06 首次发布

本文链接：https://blog.csdn.net/qq_42619416/article/details/127437238

版权

数据仓库专栏收录该内容

4 篇文章 2 订阅

订阅专栏

前言

本博客是学习记录，可能存在错误，仅供参考。

如发现错误，欢迎在评论区指正，我会及时修改。

同时也希望大家能在评论区多和我讨论，或者私信我，讨论可以让我们学习效率更高。

现在的版本不是最终版本，我会在学习过程中不断地更新。

第4章用户行为数据采集模块

4.2 环境准备

4.2.2 Hadoop安装

1）配置集群

1.core-site配置

配置该atguigu(superUser)允许通过代理访问所有的主机节点、所有的用户所属组，所有的用户

2.yarn-site.xml配置

其中这三个参数不要直接分发，根据每台机器的内存大小，单独设置

2）项目经验

HDFS存储多目录
集群数据均衡
1. 节点间数据均衡
2. 磁盘间数据均衡
Hadoop参数调优
1. HDFS参数调优
2. YARN参数调优

4.2.3 Zookeeper安装

1) 可能出现的问题

zookeeper重命名后，和文档中的不一致，但是还用了文档中的路径，所以注意安装zookeeper后，重命名要和文档中的一样

2）zookeeper的选举机制

(3条消息) Zookeeper选举机制_流离岁月的博客-CSDN博客_zookeeper的选举机制

4.2.4 Kafka安装

先启动zookeeper,再启动kafka。

先关闭kafka,再关闭zookeeper。

配置环境变量时，要注意，一般是在hadoop102上配置，然后分发，配置完环境变量后，需要source /etc/profile一下

主题

生产者

消费者

这三个还需要学习 #待学

4.2.5 Flume安装

启动flume的时候，是根据它的配置文件进行启动的。

4.3 日志采集Flume

kafka sink 相当于一个生产者的实现，向kafka的topic中写入数据

kafka source相当于一个消费者的实现，从kafka的topic中读取数据

kafka channel使用的三种方案

参考资料：https://flume.apache.org/releases/content/1.10.1/FlumeUserGuide.html

方案一：和Flume sourse and sink一起使用

说明：

taildir读取文件中的数据，输入到kafka Channel中
kafka Channel将数据写入一个topic中
hdfs sink从kafka Channel中读取数据时，kafka Channel会先从topic中读取数据，然后传给
最后hdfs sink将数据写入到hdfs中

方案二：和Flume sourse一起使用

说明：只有从文件中，读取数据写入到kafka中

方案三：和Flume sink一起使用

说明：只有从kafka中读数据，往HDFS中写

因为kafka channel中有一个参数如下，

如果参数parseAsFlumeEvent设置成了True,则数据将会以event(header+body)的形式传输到kafka channel中，然后再从kafka channel传到kafka的topic中，有用的数据都存储在body中，因此会多存储数据header。对于离线数仓，可以在下游把body解析出来，但是对于实时数仓直接从kafka的topic中读取数据，header是没有用的。

如果参数parseAsFlumeEvent设置成了False,则数据只传输body到kafka channel，就没有header了，但是kafka channel和拦截器一起使用时，就需要用到header

对于本项目，采用方案二和方案三的结合

上游先用kafka channel（把parseAsFlumeEvent设置成了False）把数据写入kafka中，

再下游通过拦截器（#待学）

使用kafka channel可以减少一个环节，更加高效。

4.3.2 日志采集Flume配置实操

2）配置文件内容如下

1.配置sources

2.配置channels

3.最终的配置文件

#1.定义组件
a1.sources=r1
a1.channels=c1

#2.配置sources
a1.sources.r1.type=TAILDIR
a1.sources.r1.filegroups=f1
#设置监控的文件
a1.sources.r1.filegroups.f1=/opt/module/applog/log/app.*
#设置断点续传
a1.sources.r1.positionFile=/opt/module/flume/taildir_position.json


#3.配置channels
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = topic_log
a1.channels.c1.parseAsFlumeEvent = false

#4.组装
a1.sources.r1.channels=c1

3）编写Flume拦截器

拦截器使用-flume官网说明

Flume has the capability to modify/drop events in-flight. This is done with the help of interceptors. Interceptors are classes that implement org.apache.flume.interceptor.Interceptor interface. An interceptor can modify or even drop events based on any criteria chosen by the developer of the interceptor. Flume supports chaining of interceptors. This is made possible through by specifying the list of interceptor builder class names in the configuration. Interceptors are specified as a whitespace separated list in the source configuration. The order in which the interceptors are specified is the order in which they are invoked. The list of events returned by one interceptor is passed to the next interceptor in the chain. Interceptors can modify or drop events. If an interceptor needs to drop events, it just does not return that event in the list that it returns. If it is to drop all events, then it simply returns an empty list. Interceptors are named components, here is an example of how they are created through configuration:

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1

4）我的理解：

1.就是自己用java写一个拦截器的jar包，然后这个拦截器类需要继承类org.apache.flume.interceptor.Interceptor，并重写里面的接口。

2.然后用maven打成jar包（带依赖）

3.将jar包放在/opt/module/flume/lib中

4.再将这个拦截器配置到flume中，配置文件放在/opt/module/flume/job中，配置如下：

a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=com.atguigu.gmall.flume.interceptor.ETLInterceptor$Builder

其中com.atguigu.gmall.flume.interceptor.ETLInterceptor $B u i l d er 是拦截器 ja r 的 B u i l d er 全类名，注意这边得用 ‘$ ’符号，不能用‘.’符号

5.用/opt/module/flume/job中配置文件启动flume

6.然后在hadoop103中开启一个kafka消费者，挂起

7.然后向/opt/module/applog/log中的日志文件中，追加不合法的json，如果kafka消费者拿不到这条不合法的json数据，说明拦截器奏效了。

其他

ArrayList集合的索引是动态伸缩的，使用remove删除时，很容易出现数据越界的异常。

书墨客

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录