Flume的学习与使用

最新推荐文章于 2022-04-12 11:15:03 发布

关掉别看了，再学就秃啦！

最新推荐文章于 2022-04-12 11:15:03 发布

阅读量516

点赞数

分类专栏：数据获取文章标签： flume 大数据

本文链接：https://blog.csdn.net/YYDS_emmm/article/details/118496755

版权

数据获取专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Flume的学习与使用

一、Flume的简介

1.1、Flume的作用

Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。其最主要的作用就是，读取服务器本地磁盘的数据，将数据写入HDFS

1.2 、Flume的基本组件

在这里插入图片描述

（1）Agent

Agent是一个JVM进程，它以事件的形式将数据从源头送至目的。
Agent主要有3个部分组成，Source、Channel、Sink。

（2）Source

Source是负责接收数据到Flume Agent的组件，采集数据并包装成Event。Source组件可以处理各种类型、各种格式的日志数据
常用的Source组件有：
netcat：用于监听端口
spooldir：用于监控某个目录，有新增文件则进行读取
taildir：用于同时监控文件夹和文件，Flume1.7版本之后才支持
avro：监听Avro端口，可多个Agent并行运行。

（3）Channel

Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作

（4）Sink

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
常用的Sink组件有：
logger：控制台输出，多做为测试
hdfs：写入到磁盘
kafka：做实时
avro：当需要把事件写入到另一个Avro Source的Agent，Sink必须为Avro Sink
Hive：存储到hive的表中
HBase：存储到hbase的表中

二、Flume的使用

2.1 Source 端组件

2.1.1、netcat监控端口数据

source:netcat channel:memory sink:logger

安装netcat工具：

yum install -y  nc

配置文件内容：

#组件声明
a1.sources=s1
a1.channels=c1
a1.sinks=k1

#初始化数据源
a1.sources.s1.type=netcat
a1.sources.s1.bind=192.168.64.180
a1.sources.s1.port=6666 

#初始化通道
a1.channels.c1.type=memory
a1.channels.c1.capacity=100
a1.channels.c1.transactionCapacity=10

#初始化数据槽
a1.sinks.k1.type=logger

#关联组件
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

开启监听窗口：

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume01.conf -Dflume.root.logger=INFO,console
#-c 后面输入的是flume的配置文件夹
#-f 后面输入的是自己写的配置文件

使用natcat向端口发送内容：

nc -v 192.168.64.180 6666

2.1.2、spooldir监控目录

配置文件内容：

	#组件声明
	a1.sources = s1
	a1.channels = c1
	a1.sinks = k1
	#初始化数据源
	a1.sources.s1.type=spooldir
	a1.sources.s1.spoolDir = /root/data/flume #被检测的目录
	a1.sources.s1.ignorePattern=^(.)*\\.bak$
	a1.sources.s1.fileSuffix=.bak
	#初始化通道
	a1.channels.c1.type=file
	a1.channels.c1.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
	a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
	a1.channels.c1.capacity=100000
	a1.channels.transactionCapacity=10000
	#初始化数据槽
	a1.sinks.k1.type=hdfs
	a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/fakeorder/%Y-%m-%d/%H
	a1.sinks.k1.hdfs.round=true
	a1.sinks.k1.hdfs.roundValue=10
	a1.sinks.k1.hdfs.roundUnit=minute
	a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
	a1.sinks.k1.hdfs.fileSuffix=.log
	a1.sinks.k1.hdfs.useLocalTimeStamp=true
	a1.sinks.k1.hdfs.writeFormat=Text
	a1.sinks.k1.hdfs.rollCount=0
	a1.sinks.k1.hdfs.rollSize=134217728
	a1.sinks.k1.hdfs.rollInterval=0
	a1.sinks.k1.hdfs.batchSize=1000
	a1.sinks.k1.hdfs.threadsPoolSize=4
	a1.sinks.k1.hdfs.idleTimeout=0
	a1.sinks.k1.hdfs.minBlockReplicas=1
	#关联组件
	a1.sources.s1.channels=c1
	a1.sinks.k1.channel=c1

开启侦听

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume02.conf -Dflume.root.logger=INFO,console

2.1.3、taildir同时监控文件夹和文件

配置文件内容：

		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#初始化数据源
		a1.sources.s1.type=taildir
		a1.sources.s1.filegroups=f1 f2  #可以有多个文件源
		a1.sources.s1.filegroups.f1=/root/data/flume/log01/.*\\.log #可以针对文件
		a1.sources.s1.filegroups.f2=/root/data/flume/log02/.*\\.log
		a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.conf
		#初始化通道
		a1.channels.c1.type=file
		a1.channels.c1.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.c1.capacity=10000
		a1.channels.c1.transactionCapacity=1000
		#初始化数据槽
		a1.sinks.k1.type=hdfs
		a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/tailevent/%Y-%m-%d/%H
		a1.sinks.k1.hdfs.round=true
		a1.sinks.k1.hdfs.roundValue=10
		a1.sinks.k1.hdfs.roundUnit=minute
		a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
		a1.sinks.k1.hdfs.fileSuffix=.log
		a1.sinks.k1.hdfs.useLocalTimeStamp=true
		a1.sinks.k1.hdfs.writeFormat=Text
		a1.sinks.k1.hdfs.rollCount=0
		a1.sinks.k1.hdfs.rollSize=134217728
		a1.sinks.k1.hdfs.rollInterval=0
		a1.sinks.k1.hdfs.batchSize=1000
		a1.sinks.k1.hdfs.threadsPoolSize=4
		a1.sinks.k1.hdfs.idleTimeout=0
		a1.sinks.k1.hdfs.minBlockReplicas=1
		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听：

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume04.conf -Dflume.root.logger=INFO,console

2.1.4、avro监听Avro端口

配置文件内容：

		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#初始化数据源
		a1.sources.s1.type=avro
		a1.sources.s1.bind=192.168.64.180
		a1.sources.s1.port=7777
		a1.sources.s1.threads=5
		#初始化通道
		a1.channels.c1.type=file
		a1.channels.c1.=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.c1.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.c1.capacity=100000
		a1.channels.c1.transactionCapacity=10000
		#初始化数据槽
		a1.sinks.k1.type=hdfs
		a1.sinks.k1.hdfs.path=hdfs://192.168.64.180:9820/flume/events/avroevent/%Y-%m-%d/%H
		a1.sinks.k1.hdfs.round=true
		a1.sinks.k1.hdfs.roundValue=10
		a1.sinks.k1.hdfs.roundUnit=minute
		a1.sinks.k1.hdfs.filePrefix=log_%Y%m%d_%H
		a1.sinks.k1.hdfs.fileSuffix=.log
		a1.sinks.k1.hdfs.useLocalTimeStamp=true
		a1.sinks.k1.hdfs.writeFormat=Text
		a1.sinks.k1.hdfs.rollCount=0
		a1.sinks.k1.hdfs.rollSize=134217728
		a1.sinks.k1.hdfs.rollInterval=0
		a1.sinks.k1.hdfs.batchSize=100
		a1.sinks.k1.hdfs.threadsPoolSize=4
		a1.sinks.k1.hdfs.idleTimeout=0
		a1.sinks.k1.hdfs.minBlockReplicas=1
		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听：

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume03.conf -Dflume.root.logger=INFO,console

执行命令：

flume-ng avro-client -H 192.168.64.180 -p 7777 -c /opt/software/flume/flume190/conf/  -F /root/data/flume/prolog.log.bak

2.2 Sink 端组件

2.2.1、hive sink数据输出到hive中

对hive的表结构要求：
1.必须是分区表
2.必须有分桶
3.必须为orc

执行前准备工作：
1.确认metastore服务是否开启

	netstat -n | grep 9083

2.开起hive事务支持

		SET hive.support.concurrency = true;
		SET hive.enforce.bucketing = true;
		SET hive.exec.dynamic.partition.mode = nonstrict;
		SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
		SET hive.compactor.initiator.on = true;
		SET hive.compactor.worker.threads = 1;

3.flume 对 hive hcatalog jar包依赖

cp /opt/software/hive/hive312/hcatalog/share/hcatalog*.jar /opt/software/flume/flume190/lib/

4.创建hive表

		create table familyinfo(
		id int,
		name string,
		age int,
		gender string
		)
		partitioned by (intime string)
		clustered by(gender) into 2 buckets
		row format delimited
		fields terminated by ','
		lines terminated by '\n'
		stored as orc
		tblproperties('transactional'='true');

5.根据当前日期时间手动添加分区

alter table familyinfo add partition(intime= '21-07-05-16');

配置文件内容：

注意：在执行时一直报checkpoint的错误，所以直接将channnel一块的c1都去除了。	
		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#taildir source
		a1.sources.s1.type=taildir
		a1.sources.s1.filegroups=f1
		a1.sources.s1.filegroups.f1=/root/data/flume/log01/.*.log
		a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.json
		a1.sources.s1.batchSize=10

		#file channel 
		a1.channels.c1.type=file
		a1.channels.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.capacity=100
		a1.channels.transactionCapacity=10
		#hive sink
		a1.sinks.k1.type=hive
		a1.sinks.k1.hive.metastore=thrift://192.168.64.180:9083
		a1.sinks.k1.hive.database=test
		a1.sinks.k1.hive.table=familyinfo
		a1.sinks.k1.hive.partition=%y-%m-%d-%H
		a1.sinks.k1.useLocalTimeStamp=true
		a1.sinks.k1.autoCreatePartitions=false
		a1.sinks.k1.round=true
		a1.sinks.k1.batchSize=10
		a1.sinks.k1.roundValue=10
		a1.sinks.k1.roundUnit=minute
		a1.sinks.k1.serializer=DELIMITED
		a1.sinks.k1.serializer.delimited=','
		a1.sinks.k1.serializer.serdeSeparator=','
		a1.sinks.k1.serializer.fieldnames=id,name,age,gender
		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听：

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume05.cnf -Dflume.root.logger=INFO,console

2.2.2、hbase sink数据输出到hbase中

配置文件内容：

		#组件声明
		a1.sources = s1
		a1.channels = c1
		a1.sinks = k1
		#taildir source
		a1.sources.s1.type=taildir
		a1.sources.s1.filegroups=f1
		a1.sources.s1.filegroups.f1=/root/data/flume/log2/.*.log
		a1.sources.s1.positionFile=/root/data/flume/taildir/taildir_position.json
		a1.sources.s1.batchSize=10

		#file channel 
		a1.channels.c1.type=file
		a1.channels.checkpointDir=/opt/software/flume/flume190/mydata/checkpoint
		a1.channels.dataDirs=/opt/software/flume/flume190/mydata/data
		a1.channels.capacity=100
		a1.channels.transactionCapacity=10
		#hbase sink
		a1.sinks.k1.type=hbase2
		a1.sinks.k1.table = test:studentinfo
		a1.sinks.k1.columnFamily = base
		a1.sinks.k1.serializer.regex = (.*),(.*),(.*),(.*)
		a1.sinks.k1.serializer = org.apache.flume.sink.hbase2.RegexHBase2EventSerializer
		a1.sinks.k1.serializer.colNames = ROW_KEY,name,age,gender
		a1.sinks.k1.serializer.rowKeyIndex = 0
		a1.sinks.k1.batchSize = 10

		#关联组件
		a1.sources.s1.channels=c1
		a1.sinks.k1.channel=c1

开启侦听：

flume-ng agent -name a1 -c /opt/software/flume/flume190/conf/ -f /root/flume_job/conf/flume06.cnf -Dflume.root.logger=INFO,console

关掉别看了，再学就秃啦！

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Flume的学习与使用

Flume的学习与使用一、Flume的简介1.1、Flume的作用Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。其最主要的作用就是，读取服务器本地磁盘的数据，将数据写入HDFS1.2 、Flume的基本组件...
复制链接

扫一扫

专栏目录