配置flume
1. 命名agent组件(代理)
2. 描述配置source
3. 描述配置channel
4. 描述配置sink
5. 绑定source和sink到channel
agent(多个source,多个channel,多个sink)这里a1就是代理
a1.sources=r1,r2
a1.sinks=s1,s2
a1.channels=c1,c2
#source-r1
a1.sources.r1.type=
a1.sources.r1.xxx=
#sink-s1
a1. sinks.c1.type=
a1. sinks.c1.xxx=
#channel-c1
a1.channels.c1.type=
a1.channels.c1.xxx=
#binding
#source 可以配置多个channel,sink只能用一个channel
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
使用flume的步奏
1. 配置cong/flume.properties
查看上边
2. 运行
$ bin/flume-ng agent --conf conf 指定配置的目录
--conf-fileexample.conf 指定配置文件
--namea1 指定代理的名称
-Dflume.root.logger=INFO,console
核心组件的考察
1. source:
生成Event,调用ChannelProcessor的方法,将Event put到Channel中去
SourceàEventàChannelProcessoràInterceptorChain(e)à(这里还有一个通道选择器选择通道)再进入Channels;
start()
stop()
2. channel:
连接Source[event Procuder]和Sink[event Consumer],本质上Channel就是一个buffer,支持事务处理,保证原子性(put+take)Channel必须是线程安全的
put()
take()
3. sink:
连接到channel,消费里面的event,发送到目的地,有很多相应的sink类型
Sink可以根据SinkGroup和SinkProcessor进行分组,通过Processor由SinkRunner轮询出来
Sink的process()只能由一个线程访问
setChannel()
getChannel()
customSink(自定义sink)
1. 创建MySink类
2. 导出jar,放到flume classpath下
3. 配置custom-s.conf文件
4. # Name the components on this agent
5. a1.sources = r1
6. a1.sinks = k1
7. a1.channels = c1
8.
9. # Describe/configure the source
10.a1.sources.r1.type = netcat
11.a1.sources.r1.bind = localhost
12.a1.sources.r1.port = 44444
13.
14.# Describe the sink
15.a1.sinks.k1.type = com.rxcd.myflume.sink.MySink(我们自己打包的包名)
16.
17.# Use a channel which buffers events inmemory
18.a1.channels.c1.type = memory
19.a1.channels.c1.capacity = 1000
20.a1.channels.c1.transactionCapacity = 100
21.
22.# Bind the source and sink to the channel
23.a1.sources.r1.channels = c1
24.a1.sinks.k1.channel = c1
JDBC Channel
这个channel可以永久驻留到关系型数据库中Derb,和内存有区别就是不会因为宕机而消失
25.a1.channels.c1.type = memory
26.a1.channels.c1.type = jdbc
27.a1.channels.c1.db.type=MYSQL
28.a1.channels.c1.driver.class=com.mysql.jdbc.Driver
29.a1.channels.c1.driver.url=jdbc:mysql://192.168.1.106:3306/test
30.a1.channels.c1.db.user=root
31.a1.channels.c1.db.password=root
32.a1.channels.c1.db.schema=true
这里没有定义mysql的channel需要自定义mysql channel
然后配置如下
33.a1.channels.c1.type = com.rxcd.myflume.channel.MyChannel
Flie Channel
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir =/mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
可溢出的内存通道
a1.channels = c1
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 10000 进入内存队列的事件个数,如果禁用设为0
a1.channels.c1.overflowCapacity =100000 磁盘到这个数的时候就不能向里边存了(想要上传无限大,这里可以设置为0 1
a1.channels.c1.byteCapacity = 80000 字节数的容量
a1.channels.c1.checkpointDir =/mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mut/flume/data
副本通道选择器,(如果你有俩个集群,那么你就可以配置俩个通道,向俩个集群上都沉)
a1.sources = r1
a1.channels = c1c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1c2 c3
a1.sources.r1.selector.optional = c3
Multiplexing Channel Selector(复用通道选择器)
a1.sources = r1
a1.channels = c1c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4
用复用通道选择器的话,必须带头,如果是CZ那么进入到从通道,如果是US那么就进入c2,c3通道,如果是不指定事件头的话,那么就进入默认的c4
3. 创建头文件
header.txt
state=US
4. 启动flume agent
flume--channelSelector:
replicating
multiplexing
sink processor
可分组,提供负载均衡的功能,功能一样的sink可以分到同一个组中,每一个组中就可以进行负载均衡了,
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1k2
a1.sinkgroups.g1.processor.type = load_balanc
默认的sink处理器只有一个单个的sink,用户不会强制的创建一个处理器对于单个sink来说
Failover Sink Processor
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
Loadbalancing Sink Processor¶
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random
TimestampInterceptor
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
HostInterceptor(查看事件是从哪个主机过来的 )
a1.sources = r1
a1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.hostHeader = hostname
source的多种实现???
AvroSource:串行化源
ExecSource:执行一个脚本,或者是可执行命令
NetcatSource:瑞士军刀
SpoolDirectorySource:查看滚动目录的方式
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir=/home/ubuntu/a
SequenceGeneratorSource序列生成器源
a1.sources.r1.type = sque
TcpSysLog(可以用tcp传输也可以用udp传输)
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 8888
a1.sources.r1.host = localhost
Stress source(压力源,评测flume的性能)
a1.sources.r1.type =org.apache.flume.source.stressSource
a1.sources.r1.size = 10240
a1.sources.r1.maxTotalEvents = 1000000
自定义source
需要完整的限定名,需要包名类名,。。
MySource extends AbstractSource
Sink
HDFS sink
a1.sinks = s1
a1.sinks.s1.type = hdfs
a1.sinks.s1.channel = c1
a1.sinks.s1.hdfs.path =/home/ubuntu/a/%y-%m-%d/%H%M/%s
Hive Sink
最终翻译成MR,所以它得启动yarn
拦截器
有的是进行头设置的,等是一个批处理过程
sink processor
loadbalanceing,提供负载均衡的功能(sink轮流服务)
a1.sinkgroup = g1
a1.sinkgroups.gq.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.selector = random
sink groups
这里有多个sink,当然这里的sink使用就有多种策略
1. round_robbin轮询方式
2. random随机方式
当然还可以提供容错的机制
fluem拦截器
flume有一个功能就是修改和删除event
Timestamp Interceptor时间戳拦截器
这个拦截器包含event的头,(可以通过头做一个路由分发信息)
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
Host Interceptor
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.hostHeader = hostname
一般来讲flume是在本地取数据的。
那么楼主的数据不是在本地,而是在远程,但是可以通过提供的接口来取数据。那么我们只要将本地路径换成提供的接口即可。
如果想要配置我们flume的高可用,那么我们需要用到以下的架构图
配置Agent1,Agent2,Agent3,分别位于192.168.50.100-102三台机器,配置相同,如下所示:
flume-client.properties
[java] view plain copy
1. #agent1 name
2. agent1.channels = c1
3. agent1.sources = r1
4. agent1.sinks = k1 k2
5. #set gruop
6. agent1.sinkgroups = g1
7. #set channel
8. agent1.channels.c1.type = memory
9. agent1.channels.c1.capacity = 1000
10. agent1.channels.c1.transactionCapacity = 100
11. agent1.sources.r1.channels = c1
12. agent1.sources.r1.type = exec
13. agent1.sources.r1.command = tail -F /home/hadoop/flumetest/dir/logdfs/flumetest.log
14. agent1.sources.r1.interceptors = i1 i2
15. agent1.sources.r1.interceptors.i1.type = static
16. agent1.sources.r1.interceptors.i1.key = Type
17. agent1.sources.r1.interceptors.i1.value = LOGIN
18. agent1.sources.r1.interceptors.i2.type = timestamp
19. # set sink1
20. agent1.sinks.k1.channel = c1
21. agent1.sinks.k1.type = avro
22. agent1.sinks.k1.hostname = hadoopmaster
23. agent1.sinks.k1.port = 52020
24. # set sink2
25. agent1.sinks.k2.channel = c1
26. agent1.sinks.k2.type = avro
27. agent1.sinks.k2.hostname = hadoopslave1
28. agent1.sinks.k2.port = 52020
29. #set sink group
30. agent1.sinkgroups.g1.sinks = k1 k2
31. #set failover
32. agent1.sinkgroups.g1.processor.type = failover
33. agent1.sinkgroups.g1.processor.priority.k1 = 10
34. agent1.sinkgroups.g1.processor.priority.k2 = 1
35. agent1.sinkgroups.g1.processor.maxpenalty = 10000
配置Collector1和Collector2,分别位于192.168.50.100-101两台台机器,绑定的IP(或主机名)不同,需要修改为各自所在机器的IP(或主机名)
192.168.50.100(hadoopmaster)的flume-server.properties配置如下:
[java] view plain copy
1. #set Agent name
2. a1.sources = r1
3. a1.channels = c1
4. a1.sinks = k1
5. #set channel
6. a1.channels.c1.type = memory
7. a1.channels.c1.capacity = 1000
8. a1.channels.c1.transactionCapacity = 100
9. # other node,nna to nns
10. a1.sources.r1.type = avro
11. a1.sources.r1.bind = hadoopmaster
12. a1.sources.r1.port = 52020
13. a1.sources.r1.interceptors = i1
14. a1.sources.r1.interceptors.i1.type = static
15. a1.sources.r1.interceptors.i1.key = Collector
16. a1.sources.r1.interceptors.i1.value = hadoopmaster
17. a1.sources.r1.channels = c1
18. #set sink to hdfs
19. a1.sinks.k1.type=hdfs
20. a1.sinks.k1.hdfs.path=hdfs://hadoopmaster:8020/flume/logdfs
21. a1.sinks.k1.hdfs.fileType=DataStream
22. a1.sinks.k1.hdfs.writeFormat=TEXT
23. a1.sinks.k1.hdfs.rollInterval=1
24. a1.sinks.k1.channel=c1
25. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
26. a1.sinks.k1.hdfs.fileSuffix=.txt
192.168.50.101(hadoopslave1)的flume-server.properties配置如下:
[java] view plain copy
1. #set Agent name
2. a1.sources = r1
3. a1.channels = c1
4. a1.sinks = k1
5. #set channel
6. a1.channels.c1.type = memory
7. a1.channels.c1.capacity = 1000
8. a1.channels.c1.transactionCapacity = 100
9. # other node,nna to nns
10. a1.sources.r1.type = avro
11. a1.sources.r1.bind = hadoopslave1
12. a1.sources.r1.port = 52020
13. a1.sources.r1.interceptors = i1
14. a1.sources.r1.interceptors.i1.type = static
15. a1.sources.r1.interceptors.i1.key = Collector
16. a1.sources.r1.interceptors.i1.value = hadoopslave1
17. a1.sources.r1.channels = c1
18. #set sink to hdfs
19. a1.sinks.k1.type=hdfs
20. a1.sinks.k1.hdfs.path=hdfs://hadoopmaster:8020/flume/logdfs
21. a1.sinks.k1.hdfs.fileType=DataStream
22. a1.sinks.k1.hdfs.writeFormat=TEXT
23. a1.sinks.k1.hdfs.rollInterval=1
24. a1.sinks.k1.channel=c1
25. a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
26. a1.sinks.k1.hdfs.fileSuffix=.txt
我们需要将flume进行分层,然后通过avro的方式进行传输,达到我们的高可用