flume 实时读取数据输案例

最新推荐文章于 2023-04-26 19:02:14 发布

笑面天下

最新推荐文章于 2023-04-26 19:02:14 发布

阅读量447

点赞数 1

文章标签： flume

本文链接：https://blog.csdn.net/m0_48379126/article/details/120817258

版权

案例一：监控端口将数据实时发送到屏幕显示

1、使用的组件类型

①netcat source: 作用就是监听某个tcp端口手动的数据，将每行数据封装为一个event。
       工作原理类似于nc -l 端口

配置：
   必须属性：
   type   –   The component type name, needs to be netcat
   bind   –   Host name or IP address to bind to
   port   –   Port # to bind to

②logger sink: 作用使用logger(日志输出器)将event输出到文件或控制台,使用info级别记录event!
   必须属性：
   type   –   The component type name, needs to be logger
   可选属性：
maxBytesToLog   16   Maximum number of bytes of the Event body to log③memery channel
   必须属性：
   type   –   The component type name, needs to be memory
   可选属性：
   capacity   100   The maximum number of events stored in the channel
   transactionCapacity   100   The maximum number of events the channel will take from a source or give to a sink per transaction

2、编写配置文件

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#组名名.属性名=属性值
#定义source
a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop102
a1.sources.r1.port=44444

#定义sink
a1.sinks.k1.type=logger
a1.sinks.k1.maxBytesToLog=100

#定义chanel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

#连接组件同一个source可以对接多个channel，一个sink只能从一个channel拿数据！
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

3. 执行命令

flume-ng agent --conf conf/ --n a1 --conf-file myagents/netcatsource-loggersink.conf -Dflume.root.logger=INFO,console

--conf ：后面跟flume自带的所有配置文件

--n ：后面跟agent的别名

--conf-file ：后面跟我们自己创建的配置文件

-Dflume.root.logger=INFO,console ：表示需要打印的日志级别

案例二：监控文件将数据实时发送到HDFS上（不安全，不推荐）

1. 使用的组件类型

①EXECSource
       介绍： execsource会在agent启动时，运行一个linux命令，运行linux命令的进程要求是一个可以持续产生数据的进程！
                   将标准输出的数据封装为event!
               通常情况下，如果指定的命令退出了，那么source也会退出并且不会再封装任何的数据！
               所以使用这个source一般推荐类似cat ,tail -f 这种命令，而不是date这种只会返回一个数据，并且执行完就退出的命令！
       配置：
           必须配置：
           type   –   The component type name, needs to be exec
           command   –   The command to execute

②HDFSSink
       介绍： hdfssink将event写入到HDFS！目前只支持生成两种类型的文件： text | sequenceFile,这两种文件都可以使用压缩！
               写入到HDFS的文件可以自动滚动（关闭当前正在写的文件，创建一个新文件）。基于时间、events的数量、数据大小进行周期性的滚动！
               支持基于时间和采集数据的机器进行分桶和分区操作！
               HDFS数据所上传的目录或文件名可以包含一个格式化的转义序列，这个路径或文件名会在上传event时，被自动替换，替换为完整的路径名！
               使用此Sink要求本机已经安装了hadoop，或持有hadoop的jar包！
       配置：
           必须配置：
           type   –   The component type name, needs to be hdfs
           hdfs.path   –   HDFS directory path (eg hdfs://namenode/flume/webdata/)

           可选参考word

2、配置：

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#组名名.属性名=属性值
#定义source
a1.sources.r1.type=exec
a1.sources.r1.command=tail -f /opt/module/hive/logs/hive.log

#定义chanel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

#定义sink
a1.sinks.k1.type = hdfs
#一旦路径中含有基于时间的转义序列，要求event的header中必须有timestamp=时间戳，如果没有需要将useLocalTimeStamp = true
a1.sinks.k1.hdfs.path = hdfs://hadoop101:9000/flume/%Y%m%d/%H/%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-

#以下三个和目录的滚动相关，目录一旦设置了时间转义序列，基于时间戳滚动
#是否将时间戳向下舍
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute

#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100

#以下三个和文件的滚动相关，以下三个参数是或的关系！以下三个参数如果值为0都代表禁用！
#30秒滚动生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件到128M时滚动
a1.sinks.k1.hdfs.rollSize = 134217700
#每写多少个event滚动一次
a1.sinks.k1.hdfs.rollCount = 0
#以不压缩的文本形式进行存储
a1.sinks.k1.hdfs.fileType=DataStream

#连接组件同一个source可以对接多个channel，一个sink只能从一个channel拿数据！
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

3. 执行命令

flume-ng agent --conf conf/ --n a1 --conf-file myagents/execsource-hdfssink.conf -Dflume.root.logger=INFO,console

--conf ：后面跟flume自带的所有配置文件

--n ：后面跟agent的别名

--conf-file ：后面跟我们自己创建的配置文件

-Dflume.root.logger=INFO,console ：表示需要打印的日志级别

案例三：监控目录将数据实时发送到HDFS上

1、使用的组件类型

1.SpoolingDirSource
   简介：
       SpoolingDirSource指定本地磁盘的一个目录为"Spooling(自动收集)"的目录！这个source可以读取目录中
       新增的文件，将文件的内容封装为event!

       SpoolingDirSource在读取一整个文件到channel之后，它会采取策略，要么删除文件(是否可以删除取决于配置)，要么对文件
       进程一个完成状态的重命名，这样可以保证source持续监控新的文件！

       SpoolingDirSource和execsource不同，SpoolingDirSource是可靠的！即使flume被杀死或重启，依然不丢数据！但是为了保证
       这个特性，付出的代价是，一旦flume发现以下情况，flume就会报错，停止！
               ①一个文件已经被放入目录，在采集文件时，不能被修改
               ②文件的名在放入目录后又被重新使用（出现了重名的文件）

       要求：必须已经封闭的文件才能放入到SpoolingDirSource，在同一个SpoolingDirSource中都不能出现重名的文件！
   使用：
       必需配置：
       type   –   The component type name, needs to be spooldir.
       spoolDir   –   The directory from which to read files from.

2、配置：

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#组名名.属性名=属性值
#定义source
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/root/flume

#定义chanel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

#使用本地文件channel（速度慢但比内存安全）
#a1.channels.c1.type=file
#a1.channels.c1.type=/opt/flumeckp/checkpoint
#a1.channels.c1.type=/opt/flumeckp/data

#定义sink
a1.sinks.k1.type = hdfs
#一旦路径中含有基于时间的转义序列，要求event的header中必须有timestamp=时间戳，如果没有需要将useLocalTimeStamp = true
a1.sinks.k1.hdfs.path = hdfs://hadoop101:9000/flume/%Y%m%d/%H/%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-

#以下三个和目录的滚动相关，目录一旦设置了时间转义序列，基于时间戳滚动
#是否将时间戳向下舍
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = minute

#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100

#以下三个和文件的滚动相关，以下三个参数是或的关系！以下三个参数如果值为0都代表禁用！
#30秒滚动生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件到128M时滚动
a1.sinks.k1.hdfs.rollSize = 134217700
#每写多少个event滚动一次
a1.sinks.k1.hdfs.rollCount = 0
#以不压缩的文本形式保存数据
a1.sinks.k1.hdfs.fileType=DataStream

#连接组件同一个source可以对接多个channel，一个sink只能从一个channel拿数据！
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

3. 执行命令

flume-ng agent --conf conf/ --n a1 --conf-file myagents/spoolingsource-hdfssink.conf -Dflume.root.logger=INFO,console

--conf ：后面跟flume自带的所有配置文件

--n ：后面跟agent的别名

--conf-file ：后面跟我们自己创建的配置文件

-Dflume.root.logger=INFO,console ：表示需要打印的日志级别

案例四：监控整个目录的实时追加文件，并上传至HDFS

1、使用的组件类型

flume ng 1.7版本后提供！
    常见问题： TailDirSource采集的文件，不能随意重命名！如果日志在正在写入时，名称为 xxxx.tmp，写入完成后，滚动，
                   改名为xxx.log，此时一旦匹配规则可以匹配上述名称，就会发生数据的重复采集！
   简介：
       Taildir Source 可以读取多个文件最新追加写入的内容！
       Taildir Source是可靠的，即使flume出现了故障或挂掉。Taildir Source在工作时，会将读取文件的最后的位置记录在一个
       json文件中，一旦agent重启，会从之前已经记录的位置，继续执行tail操作！

       Json文件中，位置是可以修改，修改后，Taildir Source会从修改的位置进行tail操作！如果JSON文件丢失了，此时会重新从
       每个文件的第一行，重新读取，这会造成数据的重复！

       Taildir Source目前只能读文本文件！

   必需配置：
       channels   –
       type   –   The component type name, needs to be TAILDIR.
       filegroups   –   Space-separated list of file groups. Each file group indicates a set of files to be tailed.
       filegroups.<filegroupName>   –   Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only.

2、配置：

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#组名名.属性名=属性值
#定义source
a1.sources.r1.type=TAILDIR
a1.sources.r1.filegroups=f1 f2
a1.sources.r1.filegroups.f1=/home/atguigu/hi
a1.sources.r1.filegroups.f2=/home/atguigu/test

#定义sink
a1.sinks.k1.type=logger
a1.sinks.k1.maxBytesToLog=100

#定义chanel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

#连接组件同一个source可以对接多个channel，一个sink只能从一个channel拿数据！
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

3. 执行命令

flume-ng agent --conf conf/ --n a1 --conf-file myagents/taildirsource-loggersink.conf -Dflume.root.logger=INFO,console

--conf ：后面跟flume自带的所有配置文件

--n ：后面跟agent的别名

--conf-file ：后面跟我们自己创建的配置文件

-Dflume.root.logger=INFO,console ：表示需要打印的日志级别

案例五：kafka channel

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.channels = c1 c2

#组名名.属性名=属性值
a1.sources.r1.type=TAILDIR
a1.sources.r1.filegroups=f1
a1.sources.r1.batchSize=1000
#读取/tmp/logs/app-yyyy-mm-dd.log ^代表以xxx开头$代表以什么结尾 .代表匹配任意字符
#+代表匹配任意位置
a1.sources.r1.filegroups.f1=/tmp/logs/^app.+.log$
#JSON文件的保存位置
a1.sources.r1.positionFile=/opt/module/flume/test/log_position.json

#定义拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.zyj.dw.flume.MyInterceptor$Builder

#定义ChannelSelector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = topic
a1.sources.r1.selector.mapping.topic_start = c1
a1.sources.r1.selector.mapping.topic_event = c2

#定义chanel
a1.channels.c1.type=org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.channels.c1.kafka.topic=topic_start
a1.channels.c1.parseAsFlumeEvent=false

a1.channels.c2.type=org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.channels.c2.kafka.topic=topic_event
a1.channels.c2.parseAsFlumeEvent=false

#连接组件同一个source可以对接多个channel
a1.sources.r1.channels=c1 c2

案例六：kafka Sink

a1.channels = c1
a1.sources = s1
a1.sinks = k1

a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/mydata

a1.channels.c1.type = memory
a1.channels.c1.capacity=1000

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mydemo03
a1.sinks.k1.kafka.bootstrap.servers = 192.168.1.101:9092

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

flume —— interceptor

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#组名名.属性名=属性值
#定义source
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/opt/soft/mydata/users
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex=user.*
a1.sources.r1.interceptors.i1.excludeEvents=true

#定义chanel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

#定义sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic=users
a1.sinks.k1.kafka.bootstrap.servers=192.168.1.103:9092

#连接组件同一个source可以对接多个channel，一个sink只能从一个channel拿数据！
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

项目实例

tail source ---> kafka channel

#a1是agent的名称，a1中定义了一个叫r1的source，如果有多个，使用空格间隔
a1.sources = r1
a1.channels = c1 c2

#组名名.属性名=属性值
a1.sources.r1.type=TAILDIR
a1.sources.r1.filegroups=f1
a1.sources.r1.batchSize=1000
#读取/tmp/logs/app-yyyy-mm-dd.log ^代表以xxx开头$代表以什么结尾 .代表匹配任意字符
#+代表匹配任意位置
a1.sources.r1.filegroups.f1=/tmp/logs/^app.+.log$
#JSON文件的保存位置
a1.sources.r1.positionFile=/opt/module/flume/test/log_position.json

#定义拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.zyj.dw.flume.MyInterceptor$Builder

#定义ChannelSelector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = topic
a1.sources.r1.selector.mapping.topic_start = c1
a1.sources.r1.selector.mapping.topic_event = c2


#定义chanel
a1.channels.c1.type=org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.channels.c1.kafka.topic=topic_start
a1.channels.c1.parseAsFlumeEvent=false

a1.channels.c2.type=org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.channels.c2.kafka.topic=topic_event
a1.channels.c2.parseAsFlumeEvent=false

#连接组件 同一个source可以对接多个channel，一个sink只能从一个channel拿数据！
a1.sources.r1.channels=c1 c2

kafka source ---> hdfs sink

#配置文件编写
a1.sources = r1 r2
a1.sinks = k1 k2
a1.channels = c1 c2

#配置source
a1.sources.r1.type=org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_start
a1.sources.r1.kafka.consumer.auto.offset.reset=earliest
a1.sources.r1.kafka.consumer.group.id=CG_Start

a1.sources.r2.type=org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r2.kafka.topics=topic_event
a1.sources.r2.kafka.consumer.auto.offset.reset=earliest
a1.sources.r2.kafka.consumer.group.id=CG_Event
#配置channel
a1.channels.c1.type=file
a1.channels.c1.checkpointDir=/opt/module/flume/c1/checkpoint
#启动备用checkpoint
a1.channels.c1.useDualCheckpoints=true
a1.channels.c1.backupCheckpointDir=/opt/module/flume/c1/backupcheckpoint
#event存储的目录
a1.channels.c1.dataDirs=/opt/module/flume/c1/datas


a1.channels.c2.type=file
a1.channels.c2.checkpointDir=/opt/module/flume/c2/checkpoint
a1.channels.c2.useDualCheckpoints=true
a1.channels.c2.backupCheckpointDir=/opt/module/flume/c2/backupcheckpoint
a1.channels.c2.dataDirs=/opt/module/flume/c2/datas


#sink
a1.sinks.k1.type = hdfs
#一旦路径中含有基于时间的转义序列，要求event的header中必须有timestamp=时间戳，如果没有需要将useLocalTimeStamp = true
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-

a1.sinks.k1.hdfs.batchSize = 1000

#文件的滚动
#3600秒滚动生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件到128M时滚动
a1.sinks.k1.hdfs.rollSize = 134217700
#禁用基于event数量的文件滚动策略
a1.sinks.k1.hdfs.rollCount = 0
#指定文件使用LZO压缩格式
a1.sinks.k1.hdfs.fileType = CompressedStream 
a1.sinks.k1.hdfs.codeC = lzop
#a1.sinks.k1.hdfs.round = true
#a1.sinks.k1.hdfs.roundValue = 10
#a1.sinks.k1.hdfs.roundUnit = second



a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://hadoop102:9000/origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.batchSize = 1000
a1.sinks.k2.hdfs.rollInterval = 60
a1.sinks.k2.hdfs.rollSize = 134217700
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.fileType = CompressedStream 
a1.sinks.k2.hdfs.codeC = lzop
#a1.sinks.k2.hdfs.round = true
#a1.sinks.k2.hdfs.roundValue = 10
#a1.sinks.k2.hdfs.roundUnit = second

#连接组件
a1.sources.r1.channels=c1
a1.sources.r2.channels=c2
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c2