Flume的学习与使用

Flume的学习与使用

一、Flume的简介

1.1、Flume的作用

Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。其最主要的作用就是,读取服务器本地磁盘的数据,将数据写入HDFS

1.2 、Flume的基本组件

在这里插入图片描述

(1)Agent

Agent是一个JVM进程,它以事件的形式将数据从源头送至目的。
Agent主要有3个部分组成,Source、Channel、Sink。

(2)Source

Source是负责接收数据到Flume Agent的组件,采集数据并包装成Event。Source组件可以处理各种类型、各种格式的日志数据
常用的Source组件有:
netcat:用于监听端口
spooldir:用于监控某个目录,有新增文件则进行读取
taildir:用于同时监控文件夹和文件,Flume1.7版本之后才支持
avro:监听Avro端口,可多个Agent并行运行。

(3)Channel

Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作

(4)Sink

Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
常用的Sink组件有:
logger:控制台输出,多做为测试
hdfs:写入到磁盘
kafka:做实时
avro:当需要把事件写入到另一个Avro Source的Agent,Sink必须为Avro Sink
Hive:存储到hive的表中
HBase:存储到hbase的表中

二、Flume的使用

2.1 Source 端组件
2.1.1、netcat监控端口数据

source:netcat channel:memory sink:logger

安装netcat工具:

yum install -y  nc

配置文件内容:

#组件声明
a1.sources=s1
a1.channels=c1
a1.sinks=k1

#初始化数据源
a1.sources.s1.type=netcat
a1.sources.s1.bind=192.168.64.180
a1.sources.s1.port=6666 

#初始化通道
a1.channels.c1.type=memory
a1.channels.c1.capacity=100
a1.channels.c1.transactionCapacity=10

#初始化数据槽
a1.sinks.k1.type=logger

#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

开启监听窗口:

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume01.conf -Dflume.root.logger=INFO,console
#-c 后面输入的是flume的配置文件夹
#-f 后面输入的是自己写的配置文件

使用natcat向端口发送内容:

nc -v 192.168.64.180 6666
2.1.2、spooldir监控目录

配置文件内容:

	#组件声明
	a1.sources = s1
	a1.channels = c1
	a1.sinks = k1
	#初始化数据源
	a1.sources.s1.type=spooldir
	a1.sources.s1.spoolDir = /root/data/flume #被检测的目录
	a1.sources.s1.ignorePattern=^(.)*\\.bak$
	a1.sources.s1.fileSuffix=.bak
	#初始化通道
	a1.channels.c1.type=file
	a1.channels.c1.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
	a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
	a1.channels.c1.capacity=100000
	a1.channels.transactionCapacity=10000
	#初始化数据槽
	a1.sinks.k1.type=hdfs
	a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/fakeorder/%Y-%m-%d/%H
	a1.sinks.k1.hdfs.round=true
	a1.sinks.k1.hdfs.roundValue=10
	a1.sinks.k1.hdfs.roundUnit=minute
	a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
	a1.sinks.k1.hdfs.fileSuffix=.log
	a1.sinks.k1.hdfs.useLocalTimeStamp=true
	a1.sinks.k1.hdfs.writeFormat=Text
	a1.sinks.k1.hdfs.rollCount=0
	a1.sinks.k1.hdfs.rollSize=134217728
	a1.sinks.k1.hdfs.rollInterval=0
	a1.sinks.k1.hdfs.batchSize=1000
	a1.sinks.k1.hdfs.threadsPoolSize=4
	a1.sinks.k1.hdfs.idleTimeout=0
	a1.sinks.k1.hdfs.minBlockReplicas=1
	#关联组件
	a1.sources.s1.channels=c1
	a1.sinks.k1.channel=c1

开启侦听

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume02.conf -Dflume.root.logger=INFO,console
2.1.3、taildir同时监控文件夹和文件

配置文件内容:

		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#初始化数据源
		a1.sources.s1.type=taildir
		a1.sources.s1.filegroups=f1 f2  #可以有多个文件源
		a1.sources.s1.filegroups.f1=/root/data/flume/log01/.*\\.log #可以针对文件
		a1.sources.s1.filegroups.f2=/root/data/flume/log02/.*\\.log
		a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.conf
		#初始化通道
		a1.channels.c1.type=file
		a1.channels.c1.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.c1.capacity=10000
		a1.channels.c1.transactionCapacity=1000
		#初始化数据槽
		a1.sinks.k1.type=hdfs
		a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/tailevent/%Y-%m-%d/%H
		a1.sinks.k1.hdfs.round=true
		a1.sinks.k1.hdfs.roundValue=10
		a1.sinks.k1.hdfs.roundUnit=minute
		a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
		a1.sinks.k1.hdfs.fileSuffix=.log
		a1.sinks.k1.hdfs.useLocalTimeStamp=true
		a1.sinks.k1.hdfs.writeFormat=Text
		a1.sinks.k1.hdfs.rollCount=0
		a1.sinks.k1.hdfs.rollSize=134217728
		a1.sinks.k1.hdfs.rollInterval=0
		a1.sinks.k1.hdfs.batchSize=1000
		a1.sinks.k1.hdfs.threadsPoolSize=4
		a1.sinks.k1.hdfs.idleTimeout=0
		a1.sinks.k1.hdfs.minBlockReplicas=1
		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听:

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume04.conf -Dflume.root.logger=INFO,console
2.1.4、avro监听Avro端口

配置文件内容:

		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#初始化数据源
		a1.sources.s1.type=avro
		a1.sources.s1.bind=192.168.64.180
		a1.sources.s1.port=7777
		a1.sources.s1.threads=5
		#初始化通道
		a1.channels.c1.type=file
		a1.channels.c1.=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.c1.capacity=100000
		a1.channels.c1.transactionCapacity=10000
		#初始化数据槽
		a1.sinks.k1.type=hdfs
		a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/avroevent/%Y-%m-%d/%H
		a1.sinks.k1.hdfs.round=true
		a1.sinks.k1.hdfs.roundValue=10
		a1.sinks.k1.hdfs.roundUnit=minute
		a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
		a1.sinks.k1.hdfs.fileSuffix=.log
		a1.sinks.k1.hdfs.useLocalTimeStamp=true
		a1.sinks.k1.hdfs.writeFormat=Text
		a1.sinks.k1.hdfs.rollCount=0
		a1.sinks.k1.hdfs.rollSize=134217728
		a1.sinks.k1.hdfs.rollInterval=0
		a1.sinks.k1.hdfs.batchSize=100
		a1.sinks.k1.hdfs.threadsPoolSize=4
		a1.sinks.k1.hdfs.idleTimeout=0
		a1.sinks.k1.hdfs.minBlockReplicas=1
		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听:

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume03.conf -Dflume.root.logger=INFO,console

执行命令:

flume-ng avro-client -H 192.168.64.180 -p 7777 -c /opt/software/flume/flume190/conf/  -F /root/data/flume/prolog.log.bak
2.2 Sink 端组件
2.2.1、hive sink数据输出到hive中

对hive的表结构要求:
1.必须是分区表
2.必须有分桶
3.必须为orc

执行前准备工作:
1.确认metastore服务是否开启

	netstat -n | grep 9083

2.开起hive事务支持

		SET hive.support.concurrency = true;
		SET hive.enforce.bucketing = true;
		SET hive.exec.dynamic.partition.mode = nonstrict;
		SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
		SET hive.compactor.initiator.on = true;
		SET hive.compactor.worker.threads = 1;

3.flume 对 hive hcatalog jar包依赖

cp /opt/software/hive/hive312/hcatalog/share/hcatalog*.jar /opt/software/flume/flume190/lib/

4.创建hive表

		create table familyinfo(
		id int,
		name string,
		age int,
		gender string
		)
		partitioned by (intime string)
		clustered by(gender) into 2 buckets
		row format delimited
		fields terminated by ','
		lines terminated by '\n'
		stored as orc
		tblproperties('transactional'='true');

5.根据当前日期时间手动添加分区

alter table familyinfo add partition(intime= '21-07-05-16');

配置文件内容:

注意:在执行时一直报checkpoint的错误,所以直接将channnel一块的c1都去除了。	
		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#taildir source
		a1.sources.s1.type=taildir
		a1.sources.s1.filegroups=f1
		a1.sources.s1.filegroups.f1=/root/data/flume/log01/.*.log
		a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.json
		a1.sources.s1.batchSize=10

		#file channel 
		a1.channels.c1.type=file
		a1.channels.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.capacity=100
		a1.channels.transactionCapacity=10
		#hive sink
		a1.sinks.k1.type=hive
		a1.sinks.k1.hive.metastore=thrift://192.168.64.180:9083
		a1.sinks.k1.hive.database=test
		a1.sinks.k1.hive.table=familyinfo
		a1.sinks.k1.hive.partition=%y-%m-%d-%H
		a1.sinks.k1.useLocalTimeStamp=true
		a1.sinks.k1.autoCreatePartitions=false
		a1.sinks.k1.round=true
		a1.sinks.k1.batchSize=10
		a1.sinks.k1.roundValue=10
		a1.sinks.k1.roundUnit=minute
		a1.sinks.k1.serializer=DELIMITED
		a1.sinks.k1.serializer.delimited=','
		a1.sinks.k1.serializer.serdeSeparator=','
		a1.sinks.k1.serializer.fieldnames=id,name,age,gender
		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听:

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume05.cnf -Dflume.root.logger=INFO,console
2.2.2、hbase sink数据输出到hbase中

配置文件内容:

		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#taildir source
		a1.sources.s1.type=taildir
		a1.sources.s1.filegroups=f1
		a1.sources.s1.filegroups.f1=/root/data/flume/log2/.*.log
		a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.json
		a1.sources.s1.batchSize=10

		#file channel 
		a1.channels.c1.type=file
		a1.channels.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.capacity=100
		a1.channels.transactionCapacity=10
		#hbase sink
		a1.sinks.k1.type=hbase2
		a1.sinks.k1.table = test:studentinfo
		a1.sinks.k1.columnFamily = base
		a1.sinks.k1.serializer.regex = (.*),(.*),(.*),(.*)
		a1.sinks.k1.serializer = org.apache.flume.sink.hbase2.RegexHBase2EventSerializer
		a1.sinks.k1.serializer.colNames = ROW_KEY,name,age,gender
		a1.sinks.k1.serializer.rowKeyIndex = 0
		a1.sinks.k1.batchSize = 10

		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听:

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume06.cnf -Dflume.root.logger=INFO,console
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值