flumes深入理解

最新推荐文章于 2022-07-12 14:26:25 发布

mn_kw

最新推荐文章于 2022-07-12 14:26:25 发布

阅读量718

点赞数

分类专栏： flume 文章标签： flume

本文链接：https://blog.csdn.net/mn_kw/article/details/79876876

版权

flume 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

配置flume

1. 命名agent组件（代理）

2. 描述配置source

3. 描述配置channel

4. 描述配置sink

5. 绑定source和sink到channel

agent(多个source,多个channel,多个sink)这里a1就是代理

a1.sources=r1,r2

a1.sinks=s1,s2

a1.channels=c1,c2

#source-r1

a1.sources.r1.type=

a1.sources.r1.xxx=

#sink-s1

a1. sinks.c1.type=

a1. sinks.c1.xxx=

#channel-c1

a1.channels.c1.type=

a1.channels.c1.xxx=

#binding

#source 可以配置多个channel，sink只能用一个channel

a1.sources.r1.channels=c1

a1.sinks.s1.channel=c1

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events inmemory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

使用flume的步奏

1. 配置cong/flume.properties

查看上边

2. 运行

$ bin/flume-ng agent --conf conf 指定配置的目录

--conf-fileexample.conf 指定配置文件

--namea1 指定代理的名称

-Dflume.root.logger=INFO,console

核心组件的考察

1. source:

生成Event,调用ChannelProcessor的方法，将Event put到Channel中去

SourceàEventàChannelProcessoràInterceptorChain(e)à（这里还有一个通道选择器选择通道）再进入Channels;

start()

stop()

2. channel:

连接Source[event Procuder]和Sink[event Consumer],本质上Channel就是一个buffer,支持事务处理，保证原子性（put+take）Channel必须是线程安全的

put()

take()

3. sink:

连接到channel，消费里面的event,发送到目的地，有很多相应的sink类型

Sink可以根据SinkGroup和SinkProcessor进行分组，通过Processor由SinkRunner轮询出来

Sink的process()只能由一个线程访问

setChannel()

getChannel()

customSink(自定义sink)

1. 创建MySink类

2. 导出jar,放到flume classpath下

3. 配置custom-s.conf文件

4. # Name the components on this agent

5. a1.sources = r1

6. a1.sinks = k1

7. a1.channels = c1

9. # Describe/configure the source

10.a1.sources.r1.type = netcat

11.a1.sources.r1.bind = localhost

12.a1.sources.r1.port = 44444

13.

14.# Describe the sink

15.a1.sinks.k1.type = com.rxcd.myflume.sink.MySink(我们自己打包的包名)

16.

17.# Use a channel which buffers events inmemory

18.a1.channels.c1.type = memory

19.a1.channels.c1.capacity = 1000

20.a1.channels.c1.transactionCapacity = 100

21.

22.# Bind the source and sink to the channel

23.a1.sources.r1.channels = c1

24.a1.sinks.k1.channel = c1

JDBC Channel

这个channel可以永久驻留到关系型数据库中Derb,和内存有区别就是不会因为宕机而消失

25.a1.channels.c1.type = memory

26.a1.channels.c1.type = jdbc

27.a1.channels.c1.db.type=MYSQL

28.a1.channels.c1.driver.class=com.mysql.jdbc.Driver

29.a1.channels.c1.driver.url=jdbc:mysql://192.168.1.106:3306/test

30.a1.channels.c1.db.user=root

31.a1.channels.c1.db.password=root

32.a1.channels.c1.db.schema=true

这里没有定义mysql的channel需要自定义mysql channel

然后配置如下

33.a1.channels.c1.type = com.rxcd.myflume.channel.MyChannel

Flie Channel

a1.channels = c1

a1.channels.c1.type = file

a1.channels.c1.checkpointDir =/mnt/flume/checkpoint

a1.channels.c1.dataDirs = /mnt/flume/data

可溢出的内存通道

a1.channels = c1

a1.channels.c1.type = SPILLABLEMEMORY

a1.channels.c1.memoryCapacity = 10000 进入内存队列的事件个数，如果禁用设为0

a1.channels.c1.overflowCapacity =100000 磁盘到这个数的时候就不能向里边存了（想要上传无限大，这里可以设置为0 1

a1.channels.c1.byteCapacity = 80000 字节数的容量

a1.channels.c1.checkpointDir =/mnt/flume/checkpoint

a1.channels.c1.dataDirs = /mut/flume/data

副本通道选择器，（如果你有俩个集群，那么你就可以配置俩个通道，向俩个集群上都沉）

a1.sources = r1

a1.channels = c1c2 c3

a1.sources.r1.selector.type = replicating

a1.sources.r1.channels = c1c2 c3

a1.sources.r1.selector.optional = c3

Multiplexing Channel Selector（复用通道选择器）

a1.sources = r1

a1.channels = c1c2 c3 c4

a1.sources.r1.selector.type = multiplexing

a1.sources.r1.selector.header = state

a1.sources.r1.selector.mapping.CZ = c1

a1.sources.r1.selector.mapping.US = c2 c3

a1.sources.r1.selector.default = c4

用复用通道选择器的话，必须带头，如果是CZ那么进入到从通道，如果是US那么就进入c2,c3通道，如果是不指定事件头的话，那么就进入默认的c4

3. 创建头文件

header.txt

state=US

4. 启动flume agent

flume--channelSelector:

replicating

multiplexing

sink processor

可分组，提供负载均衡的功能，功能一样的sink可以分到同一个组中，每一个组中就可以进行负载均衡了，

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1k2

a1.sinkgroups.g1.processor.type = load_balanc

默认的sink处理器只有一个单个的sink，用户不会强制的创建一个处理器对于单个sink来说

Failover Sink Processor

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1k2

a1.sinkgroups.g1.processor.type = failover

a1.sinkgroups.g1.processor.priority.k1 = 5

a1.sinkgroups.g1.processor.priority.k2 = 10

a1.sinkgroups.g1.processor.maxpenalty = 10000

Loadbalancing Sink Processor¶

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1k2

a1.sinkgroups.g1.processor.type = load_balance

a1.sinkgroups.g1.processor.backoff = true

a1.sinkgroups.g1.processor.selector = random

TimestampInterceptor

a1.sources = r1

a1.channels = c1

a1.sources.r1.channels = c1

a1.sources.r1.type = seq

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = timestamp

HostInterceptor(查看事件是从哪个主机过来的 )

a1.sources = r1

a1.channels = c1

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = host

a1.sources.r1.interceptors.i1.hostHeader = hostname

source的多种实现？？？

AvroSource：串行化源

ExecSource:执行一个脚本，或者是可执行命令

NetcatSource：瑞士军刀

SpoolDirectorySource：查看滚动目录的方式

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir=/home/ubuntu/a

SequenceGeneratorSource序列生成器源

a1.sources.r1.type = sque

TcpSysLog(可以用tcp传输也可以用udp传输)

a1.sources.r1.type = syslogtcp

a1.sources.r1.port = 8888

a1.sources.r1.host = localhost

Stress source(压力源，评测flume的性能)

a1.sources.r1.type =org.apache.flume.source.stressSource

a1.sources.r1.size = 10240

a1.sources.r1.maxTotalEvents = 1000000

自定义source

需要完整的限定名，需要包名类名，。。

MySource extends AbstractSource

Sink

HDFS sink

a1.sinks = s1

a1.sinks.s1.type = hdfs

a1.sinks.s1.channel = c1

a1.sinks.s1.hdfs.path =/home/ubuntu/a/%y-%m-%d/%H%M/%s

Hive Sink

最终翻译成MR，所以它得启动yarn

拦截器

有的是进行头设置的，等是一个批处理过程

sink processor

loadbalanceing,提供负载均衡的功能(sink轮流服务)

a1.sinkgroup = g1

a1.sinkgroups.gq.sinks = k1 k2

a1.sinkgroups.g1.processor.type = load_balance

a1.sinkgroups.g1.processor.selector = random

sink groups

这里有多个sink，当然这里的sink使用就有多种策略

1. round_robbin轮询方式

2. random随机方式

当然还可以提供容错的机制

fluem拦截器

flume有一个功能就是修改和删除event

Timestamp Interceptor时间戳拦截器

这个拦截器包含event的头，(可以通过头做一个路由分发信息)

a1.sources = r1

a1.channels = c1

a1.sources.r1.channels = c1

a1.sources.r1.type = seq

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = timestamp

Host Interceptor

a1.sources = r1

a1.channels = c1

a1.sources.r1.channels = c1

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = host

a1.sources.r1.interceptors.i1.hostHeader = hostname

一般来讲flume是在本地取数据的。
那么楼主的数据不是在本地，而是在远程，但是可以通过提供的接口来取数据。那么我们只要将本地路径换成提供的接口即可。

如果想要配置我们flume的高可用，那么我们需要用到以下的架构图

配置Agent1，Agent2，Agent3，分别位于192.168.50.100-102三台机器，配置相同，如下所示：
flume-client.properties

[java] view plain copy

1. #agent1 name

2. agent1.channels = c1

3. agent1.sources = r1

4. agent1.sinks = k1 k2

5. #set gruop

6. agent1.sinkgroups = g1

7. #set channel

8. agent1.channels.c1.type = memory

9. agent1.channels.c1.capacity = 1000

10. agent1.channels.c1.transactionCapacity = 100

11. agent1.sources.r1.channels = c1

12. agent1.sources.r1.type = exec

13. agent1.sources.r1.command = tail -F /home/hadoop/flumetest/dir/logdfs/flumetest.log

14. agent1.sources.r1.interceptors = i1 i2

15. agent1.sources.r1.interceptors.i1.type = static

16. agent1.sources.r1.interceptors.i1.key = Type

17. agent1.sources.r1.interceptors.i1.value = LOGIN

18. agent1.sources.r1.interceptors.i2.type = timestamp

19. # set sink1

20. agent1.sinks.k1.channel = c1

21. agent1.sinks.k1.type = avro

22. agent1.sinks.k1.hostname = hadoopmaster

23. agent1.sinks.k1.port = 52020

24. # set sink2

25. agent1.sinks.k2.channel = c1

26. agent1.sinks.k2.type = avro

27. agent1.sinks.k2.hostname = hadoopslave1

28. agent1.sinks.k2.port = 52020

29. #set sink group

30. agent1.sinkgroups.g1.sinks = k1 k2

31. #set failover

32. agent1.sinkgroups.g1.processor.type = failover

33. agent1.sinkgroups.g1.processor.priority.k1 = 10

34. agent1.sinkgroups.g1.processor.priority.k2 = 1

35. agent1.sinkgroups.g1.processor.maxpenalty = 10000

配置Collector1和Collector2，分别位于192.168.50.100-101两台台机器，绑定的IP（或主机名）不同，需要修改为各自所在机器的IP（或主机名）
192.168.50.100（hadoopmaster）的flume-server.properties配置如下：

[java] view plain copy

1. #set Agent name

2. a1.sources = r1

3. a1.channels = c1

4. a1.sinks = k1

5. #set channel

6. a1.channels.c1.type = memory

7. a1.channels.c1.capacity = 1000

8. a1.channels.c1.transactionCapacity = 100

9. # other node,nna to nns

10. a1.sources.r1.type = avro

11. a1.sources.r1.bind = hadoopmaster

12. a1.sources.r1.port = 52020

13. a1.sources.r1.interceptors = i1

14. a1.sources.r1.interceptors.i1.type = static

15. a1.sources.r1.interceptors.i1.key = Collector

16. a1.sources.r1.interceptors.i1.value = hadoopmaster

17. a1.sources.r1.channels = c1

18. #set sink to hdfs

19. a1.sinks.k1.type=hdfs

20. a1.sinks.k1.hdfs.path=hdfs://hadoopmaster:8020/flume/logdfs

21. a1.sinks.k1.hdfs.fileType=DataStream

22. a1.sinks.k1.hdfs.writeFormat=TEXT

23. a1.sinks.k1.hdfs.rollInterval=1

24. a1.sinks.k1.channel=c1

25. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

26. a1.sinks.k1.hdfs.fileSuffix=.txt

192.168.50.101（hadoopslave1）的flume-server.properties配置如下：

[java] view plain copy

1. #set Agent name

2. a1.sources = r1

3. a1.channels = c1

4. a1.sinks = k1

5. #set channel

6. a1.channels.c1.type = memory

7. a1.channels.c1.capacity = 1000

8. a1.channels.c1.transactionCapacity = 100

9. # other node,nna to nns

10. a1.sources.r1.type = avro

11. a1.sources.r1.bind = hadoopslave1

12. a1.sources.r1.port = 52020

13. a1.sources.r1.interceptors = i1

14. a1.sources.r1.interceptors.i1.type = static

15. a1.sources.r1.interceptors.i1.key = Collector

16. a1.sources.r1.interceptors.i1.value = hadoopslave1

17. a1.sources.r1.channels = c1

18. #set sink to hdfs

19. a1.sinks.k1.type=hdfs

20. a1.sinks.k1.hdfs.path=hdfs://hadoopmaster:8020/flume/logdfs

21. a1.sinks.k1.hdfs.fileType=DataStream

22. a1.sinks.k1.hdfs.writeFormat=TEXT

23. a1.sinks.k1.hdfs.rollInterval=1

24. a1.sinks.k1.channel=c1

25. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

26. a1.sinks.k1.hdfs.fileSuffix=.txt

我们需要将flume进行分层，然后通过avro的方式进行传输，达到我们的高可用

mn_kw

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录