一、重温
1、source
2、channel
3、sink
二、Flume
采集指定文件的内容,带分区的传到HDFS
1、按照设置的roll时间来生成.tmp文件
2、round
是否使用时间戳来处理。由于我们想使用时间来创建文件,因此这里要选择true
3、roundValue and roundUnit
1秒钟,按需求
conf文件exec-memory-hdfs-partition.conf:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/data.log
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop002:9000/data/flume/page_views/%Y%m%d%H%M
a1.sinks.k1.hdfs.batchSize = 10
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10485760
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = page-views
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 10000
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
命令:
./flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/script/flume/exec-memory-hdfs-partition.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343
二、多Agent服务
1、一个Agent的输出作为另一个Agent的输入
2、多个Agent汇聚到一个Agent的Source,再从这个Agent将数据写到HDFS
从那个地址里面取数据
数据写入哪个地址里
3、为什么需要多层Agent汇聚(思考)
4、一个Source对应多个sinks、channels
client --> source -------------------------------> channel ----------------------------->sink
Flume Channel Selectors Flume Sink Processors
1)FlumeChannelSelectors的例子
Replicating Channel Selector:
a1.sources = r1 a1.channels = c1 c2 c3 a1.source.r1.selector.type = replicating a1.source.r1.channels = c1 c2 c3 a1.source.r1.selector.optional = c3
Multiplexing Channel Selector:
a1.sources = r1 a1.channels = c1 c2 c3 c4 a1.sources.r1.selector.type = multiplexing a1.sources.r1.selector.header = state Event的header的名字为state,如果state的数据是哪个c,就设置为那个c a1.sources.r1.selector.mapping.CZ = c1 a1.sources.r1.selector.mapping.US = c2 c3 a1.sources.r1.selector.default = c4
2)Flume Sink Processors例子
failover :
load_balance:负载均衡
Failover Sink Processor
a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = failover a1.sinkgroups.g1.processor.priority.k1 = 5 a1.sinkgroups.g1.processor.priority.k2 = 10 (数字越大,优先级越高) a1.sinkgroups.g1.processor.maxpenalty = 10000
四、功能需求
需求:2个机器,把数据通过agent1到agent2输出到控制台
agent1:exec-memory-avro sink arvo-sink.conf
agent2:avro source - memory - logger avro-source.conf
avro-sink.conf:
avro-sink-agent.sources = exec-source
avro-sink-agent.sinks = avro-sink
avro-sink-agent.channels = avro-memory-channel
avro-sink-agent.sources.exec-source.type = exec
avro-sink-agent.sources.exec-source.command = tail -F /opt/data/avro_access.log
avro-sink-agent.sources.exec-source.channels = avro-memory-channel
avro-sink-agent.channels.avro-memory-channel.type = memory
avro-sink-agent.sinks.avro-sink.type = avro
avro-sink-agent.sinks.avro-sink.channel = avro-memory-channel
avro-sink-agent.sinks.avro-sink.hostname = 0.0.0.0
avro-sink-agent.sinks.avro-sink.port = 44444
avro-source.conf:
avro-source-agent.sources = avro-source
avro-source-agent.sinks = logger-sink
avro-source-agent.channels = avro-memory-channel
avro-source-agent.sources.avro-source.type = avro
avro-source-agent.sources.avro-source.channels = avro-memory-channel
avro-source-agent.sources.avro-source.bind = 0.0.0.0
avro-source-agent.sources.avro-source.port = 44444
avro-source-agent.channels.avro-memory-channel.type = memory
avro-source-agen.sinks.logger-sink.type = logger
avro-source-agen.sinks.logger-sink.channel = avro-memory-channel
命令1:
flume-ng agent \
--name avro-sink-agent \
--conf $FLUME_HOME/conf \
--conf-file /opt/script/flume/avro-sink.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34344
命令2:
flume-ng agent \
--name avro-source-agent \
--conf $FLUME_HOME/conf \
--conf-file /opt/script/flume/avro-source.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343
总结:
1)先2后1
2)两个agent的端口不一样
需求:
IDEA搭建JavaWeb服务
1、IDEA在SCALA项目中建一个Directory,并Mark Directory as RourceRoot
2、建立java类
3、打印index日志
package com.ruozedata.flume; import org.apache.log4j.Logger; //这里使用apache.log4j.Logger的方法,因为后面要通过修改log4j.propertites配置来达到将数据传输到Linux上的功能 public class LoggerGenerator { private static Logger logger = Logger.getLogger(LoggerGenerator.class.getName()); public static void main(String[] args) throws Exception{ int index = 0; while (true){ Thread.sleep(1000); logger.info("ruozeshuju" + index++); } } }
4、将IDEA打印的日志发送到Linux的Agent里
启动avro-source.conf,将传到本机44444端口的日志打印到控制台
5、Log4j Appender
Appends Log4j events to a flume agent’s avro source.
将其内容复制到IDEA的log4j.properties上
6、添加依赖
A client using this appender must have the flume-ng-sdk in the classpath (eg, flume-ng-sdk-1.6.0.jar).
7、修改log4j.propertities文件
log4j.rootCategory=INFO, console, flume log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender log4j.appender.flume.Hostname = hadoop002 log4j.appender.flume.Port = 44444 log4j.appender.flume.UnsafeMode = true log4j.appender.flume.layout=org.apache.log4j.PatternLayout
注意:该文件要放在独立的resources文件夹下,并Directory as resources root才生效
8、启动avro-rouce.conf,并执行代码
服务器上:
五、Flume Interceptors
client --> source -------------------------------> channel ----------------------------->sink
Flume Channel Selectors Flume Sink Processors
interceptors常用的两种拦截器