Flume的学习与使用
一、Flume的简介
1.1、Flume的作用
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。其最主要的作用就是,读取服务器本地磁盘的数据,将数据写入HDFS
1.2 、Flume的基本组件
(1)Agent
Agent是一个JVM进程,它以事件的形式将数据从源头送至目的。
Agent主要有3个部分组成,Source、Channel、Sink。
(2)Source
Source是负责接收数据到Flume Agent的组件,采集数据并包装成Event。Source组件可以处理各种类型、各种格式的日志数据
常用的Source组件有:
netcat:用于监听端口
spooldir:用于监控某个目录,有新增文件则进行读取
taildir:用于同时监控文件夹和文件,Flume1.7版本之后才支持
avro:监听Avro端口,可多个Agent并行运行。
(3)Channel
Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作
(4)Sink
Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
常用的Sink组件有:
logger:控制台输出,多做为测试
hdfs:写入到磁盘
kafka:做实时
avro:当需要把事件写入到另一个Avro Source的Agent,Sink必须为Avro Sink
Hive:存储到hive的表中
HBase:存储到hbase的表中
二、Flume的使用
2.1 Source 端组件
2.1.1、netcat监控端口数据
source:netcat channel:memory sink:logger
安装netcat工具:
yum install -y nc
配置文件内容:
#组件声明
a1.sources=s1
a1.channels=c1
a1.sinks=k1
#初始化数据源
a1.sources.s1.type=netcat
a1.sources.s1.bind=192.168.64.180
a1.sources.s1.port=6666
#初始化通道
a1.channels.c1.type=memory
a1.channels.c1.capacity=100
a1.channels.c1.transactionCapacity=10
#初始化数据槽
a1.sinks.k1.type=logger
#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
开启监听窗口:
flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume01.conf -Dflume.root.logger=INFO,console
#-c 后面输入的是flume的配置文件夹
#-f 后面输入的是自己写的配置文件
使用natcat向端口发送内容:
nc -v 192.168.64.180 6666
2.1.2、spooldir监控目录
配置文件内容:
#组件声明
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#初始化数据源
a1.sources.s1.type=spooldir
a1.sources.s1.spoolDir = /root/data/flume #被检测的目录
a1.sources.s1.ignorePattern=^(.)*\\.bak$
a1.sources.s1.fileSuffix=.bak
#初始化通道
a1.channels.c1.type=file
a1.channels.c1.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
a1.channels.c1.capacity=100000
a1.channels.transactionCapacity=10000
#初始化数据槽
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/fakeorder/%Y-%m-%d/%H
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
a1.sinks.k1.hdfs.fileSuffix=.log
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.batchSize=1000
a1.sinks.k1.hdfs.threadsPoolSize=4
a1.sinks.k1.hdfs.idleTimeout=0
a1.sinks.k1.hdfs.minBlockReplicas=1
#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
开启侦听
flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume02.conf -Dflume.root.logger=INFO,console
2.1.3、taildir同时监控文件夹和文件
配置文件内容:
#组件声明
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#初始化数据源
a1.sources.s1.type=taildir
a1.sources.s1.filegroups=f1 f2 #可以有多个文件源
a1.sources.s1.filegroups.f1=/root/data/flume/log01/.*\\.log #可以针对文件
a1.sources.s1.filegroups.f2=/root/data/flume/log02/.*\\.log
a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.conf
#初始化通道
a1.channels.c1.type=file
a1.channels.c1.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=1000
#初始化数据槽
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/tailevent/%Y-%m-%d/%H
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
a1.sinks.k1.hdfs.fileSuffix=.log
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.batchSize=1000
a1.sinks.k1.hdfs.threadsPoolSize=4
a1.sinks.k1.hdfs.idleTimeout=0
a1.sinks.k1.hdfs.minBlockReplicas=1
#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
开启侦听:
flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume04.conf -Dflume.root.logger=INFO,console
2.1.4、avro监听Avro端口
配置文件内容:
#组件声明
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#初始化数据源
a1.sources.s1.type=avro
a1.sources.s1.bind=192.168.64.180
a1.sources.s1.port=7777
a1.sources.s1.threads=5
#初始化通道
a1.channels.c1.type=file
a1.channels.c1.=/opt/software/flume/flume190/mydata/checkpoint
a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
a1.channels.c1.capacity=100000
a1.channels.c1.transactionCapacity=10000
#初始化数据槽
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/avroevent/%Y-%m-%d/%H
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
a1.sinks.k1.hdfs.fileSuffix=.log
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.batchSize=100
a1.sinks.k1.hdfs.threadsPoolSize=4
a1.sinks.k1.hdfs.idleTimeout=0
a1.sinks.k1.hdfs.minBlockReplicas=1
#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
开启侦听:
flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume03.conf -Dflume.root.logger=INFO,console
执行命令:
flume-ng avro-client -H 192.168.64.180 -p 7777 -c /opt/software/flume/flume190/conf/ -F /root/data/flume/prolog.log.bak
2.2 Sink 端组件
2.2.1、hive sink数据输出到hive中
对hive的表结构要求:
1.必须是分区表
2.必须有分桶
3.必须为orc
执行前准备工作:
1.确认metastore服务是否开启
netstat -n | grep 9083
2.开起hive事务支持
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;
3.flume 对 hive hcatalog jar包依赖
cp /opt/software/hive/hive312/hcatalog/share/hcatalog*.jar /opt/software/flume/flume190/lib/
4.创建hive表
create table familyinfo(
id int,
name string,
age int,
gender string
)
partitioned by (intime string)
clustered by(gender) into 2 buckets
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as orc
tblproperties('transactional'='true');
5.根据当前日期时间手动添加分区
alter table familyinfo add partition(intime= '21-07-05-16');
配置文件内容:
注意:在执行时一直报checkpoint的错误,所以直接将channnel一块的c1都去除了。
#组件声明
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#taildir source
a1.sources.s1.type=taildir
a1.sources.s1.filegroups=f1
a1.sources.s1.filegroups.f1=/root/data/flume/log01/.*.log
a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.json
a1.sources.s1.batchSize=10
#file channel
a1.channels.c1.type=file
a1.channels.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
a1.channels.dataDirs=/opt/software/flume/flume190/mydata/data
a1.channels.capacity=100
a1.channels.transactionCapacity=10
#hive sink
a1.sinks.k1.type=hive
a1.sinks.k1.hive.metastore=thrift://192.168.64.180:9083
a1.sinks.k1.hive.database=test
a1.sinks.k1.hive.table=familyinfo
a1.sinks.k1.hive.partition=%y-%m-%d-%H
a1.sinks.k1.useLocalTimeStamp=true
a1.sinks.k1.autoCreatePartitions=false
a1.sinks.k1.round=true
a1.sinks.k1.batchSize=10
a1.sinks.k1.roundValue=10
a1.sinks.k1.roundUnit=minute
a1.sinks.k1.serializer=DELIMITED
a1.sinks.k1.serializer.delimited=','
a1.sinks.k1.serializer.serdeSeparator=','
a1.sinks.k1.serializer.fieldnames=id,name,age,gender
#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
开启侦听:
flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume05.cnf -Dflume.root.logger=INFO,console
2.2.2、hbase sink数据输出到hbase中
配置文件内容:
#组件声明
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#taildir source
a1.sources.s1.type=taildir
a1.sources.s1.filegroups=f1
a1.sources.s1.filegroups.f1=/root/data/flume/log2/.*.log
a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.json
a1.sources.s1.batchSize=10
#file channel
a1.channels.c1.type=file
a1.channels.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
a1.channels.dataDirs=/opt/software/flume/flume190/mydata/data
a1.channels.capacity=100
a1.channels.transactionCapacity=10
#hbase sink
a1.sinks.k1.type=hbase2
a1.sinks.k1.table = test:studentinfo
a1.sinks.k1.columnFamily = base
a1.sinks.k1.serializer.regex = (.*),(.*),(.*),(.*)
a1.sinks.k1.serializer = org.apache.flume.sink.hbase2.RegexHBase2EventSerializer
a1.sinks.k1.serializer.colNames = ROW_KEY,name,age,gender
a1.sinks.k1.serializer.rowKeyIndex = 0
a1.sinks.k1.batchSize = 10
#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
开启侦听:
flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume06.cnf -Dflume.root.logger=INFO,console