flume的使用案例两则
实时读取目录文件到HDFS案例
-
需求,使用flume监听整个目录的文件
-
需求分析:
- 在指定目录中添加文件
- 通过flume监控指定目录,其中tmp后缀的文件不上传,其他已标记的文件改为COMPLETED结尾
- 采集到的数据上传到HDFS
-
实现:
-
创建配置文件f-dir-hdfs.conf,并写入如下内容:
[kgg@hadoop201 ~]$ cd /opt/module/flume/ [kgg@hadoop201 flume]$ vi job/f-dir-hdfs.conf [kgg@hadoop201 flume]$ cat job/f-dir-hdfs.conf a3.sources = r3 a3.sinks = k3 a3.channels = c3 # Describe/configure the source a3.sources.r3.type = spooldir a3.sources.r3.spoolDir = /opt/module/datas/log a3.sources.r3.fileSuffix = .COMPLETED a3.sources.r3.fileHeader = true #忽略所有以.tmp结尾的文件,不上传 a3.sources.r3.ignorePattern = ([^ ]*\.tmp) # Describe the sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs://hadoop201:9000/flume/logs/%Y%m%d/%H #上传文件的前缀 a3.sinks.k3.hdfs.filePrefix = log- #是否按照时间滚动文件夹 a3.sinks.k3.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k3.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k3.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k3.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k3.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k3.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k3.hdfs.rollInterval = 600 #设置每个文件的滚动大小大概是128M a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a3.sinks.k3.hdfs.rollCount = 0 #最小冗余数 a3.sinks.k3.hdfs.minBlockReplicas = 1 # Use a channel which buffers events in memory a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3
-
创建指定的datas/log目录,在其中创建一些文件,其中一个以tmp作为后缀,预计不会被采集:
[kgg@hadoop201 datas]$ mkdir log [kgg@hadoop201 datas]$ cd log/ [kgg@hadoop201 log]$ echo 'aaaa' > a.log [kgg@hadoop201 log]$ echo 'bbbb' > b.log [kgg@hadoop201 log]$ echo 'cccc' > c.tmp [kgg@hadoop201 log]$ ls a.log b.log c.tmp
-
启动flume:
[kgg@hadoop201 flume]$ bin/flume-ng agent -f job/f-dir-hdfs.conf -n a3
-
观察日志,其中有提示创建hdfs中的目录(第一次启动):
20/10/06 20:04:12 INFO hdfs.BucketWriter: Creating hdfs://hadoop201:9000/flume/logs/20201006/20/log-.1601985851962.tmp
-
观察datas目录,文件后缀有变动:
[kgg@hadoop201 log]$ ls a.log.COMPLETED b.log.COMPLETED c.tmp
-
将c.tmp文件改名为c.log,观察其结果:
[kgg@hadoop201 log]$ mv c.tmp c.log [kgg@hadoop201 log]$ ls a.log.COMPLETED b.log.COMPLETED c.log.COMPLETED
-
发现c.log被采集成功,日志中也有正常的提示,案例完成:
20/10/06 20:06:18 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one. 20/10/06 20:06:18 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /opt/module/datas/log/c.log to /opt/module/datas/log/c.log.COMPLETED ##查看hdfs中的数据 [kgg@hadoop201 log]$ hadoop fs -cat /flume/logs/20201006/20/log-.1601985851962.tmp aaaa bbbb cccc
-
单数据源多出口案例
-
需求:
- 通过flume1监控文件的变动
- 通过sink1将内容存储到hdfs
- 通过sink2将内统发送给flume2
- flume2将内容存储在local filesystem
-
需求分析:
- flume1布置在hadoop201节点上
- flume2布置在hadoop202节点上
- 因为flume2要从flume1处拉取数据,所以flume2要先启动
-
准备工作:
##清空log日志中的文件,创建新的空文件a.log [kgg@hadoop201 datas]$ cd log/ [kgg@hadoop201 log]$ rm -rf * [kgg@hadoop201 log]$ touch a.log [kgg@hadoop201 log]$ ls a.log
-
创建flume1的conf文件,并配置如下内容:
[kgg@hadoop201 group]$ vi flume1.conf [kgg@hadoop201 group]$ cat flume1.conf # name a1.sources = r1 a1.channels = c1 c2 a1.sinks = k1 k2 # copy all the datas to all the channels a1.source.r1.selector.type = replicating # source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/module/datas/log/a.log a1.sources.r1.shell = /bin/bash -c # channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 # sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop201:9000/flume/logs1/%Y%m%d/%H a1.sinks.k1.hdfs.filePrefix = logs #是否按照时间滚动文件夹 a1.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a1.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a1.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true #设置文件类型,可支持压缩 a1.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a1.sinks.k1.hdfs.rollInterval = 600 #设置每个文件的滚动大小大概是128M a1.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a1.sinks.k1.hdfs.rollCount = 0 #最小冗余数 a1.sinks.k1.hdfs.minBlockReplicas = 1 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop202 a1.sinks.k2.port = 4141 # bind them a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
在hadoop202中创建flume2.conf并写入配置:
[kgg@hadoop202 group]$ vi flume2.conf [kgg@hadoop202 group]$ cat flume2.conf # name a2.sources = r1 a2.channels = c1 a2.sinks = k1 # source a2.sources.r1.type = avro a2.sources.r1.bind = hadoop202 a2.sources.r1.port = 4141 # channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # sink a2.sinks.k1.type = logger # bind them a2.source.channels = c1 a2.sinks.channel = c1
-
根据需求先后启动flume2和flume1:
[kgg@hadoop202 flume]$ bin/flume-ng agent -f job/group/flume2.conf -n a2
[kgg@hadoop201 flume]$ bin/flume-ng agent -f job/group/flume1.conf -n a1 #结果出现异常: Caused by: java.net.ConnectException: 拒绝连接: hadoop201/192.168.1.201:4141
通过反复检查,发现flume2启动后输出的日志不正常,最后发现是flume2的配置中绑定部分出错:
#原错误部分: # bind them a2.source.channels = c1 a2.sinks.channel = c1 #改正为: # bind them a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
这种粗心的错误实属不该。
改正后重新启动成功,先后启动flume后再flume2可以看到连接成功的日志:
20/10/06 21:00:56 INFO ipc.NettyServer: [id: 0x2b798040, /192.168.1.201:59467 => /192.168.1.202:4141] OPEN 20/10/06 21:00:56 INFO ipc.NettyServer: [id: 0x2b798040, /192.168.1.201:59467 => /192.168.1.202:4141] BOUND: /192.168.1.202:4141 20/10/06 21:00:56 INFO ipc.NettyServer: [id: 0x2b798040, /192.168.1.201:59467 => /192.168.1.202:4141] CONNECTED: /192.168.1.201:59467
总结
flume启动的conf文件能复制尽量复制,复制后修改能减少手打出错的机率。本人因为为了熟悉其conf文件的结构而手打大部分内容导致出现很多问题,还不好找出来。