Flume监测爬虫文件夹内容并上传到HDFS

原创已于 2023-01-07 00:53:25 修改 · 1.1k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#flume #爬虫 #hdfs

于 2022-12-23 10:45:06 首次发布

Hadoop集群搭建及实例应用专栏收录该内容

15 篇文章

订阅专栏

该博客介绍了如何使用Flume监控指定目录/export/nocv_data，并通过负载均衡策略将数据发送到多台虚拟机，最终上传到HDFS。配置包括Spooldir数据源、Avro和HDFS接收器，以及内存通道。Flume配置文件详细设置了数据采集、删除策略和HDFS滚动策略，确保高效稳定的数据传输。

前面文章提到的爬虫程序编写完成后，在虚拟机上使用python3 ****.py运行，程序中的文件输出路径为/export/nocv_data。那么我们flume的数据源的类型就是spooldir，监测路径是/export/nocv_data。这里还采取了负载均衡的策略，也就是我们规定一个组（这里三台），组内一台监测文件夹并将数据包装成事件送出，由另外两台虚拟机上传到hdfs中。另外为了防止数据过多对本地文件占用等问题，我们需要设置deletePolicy参数，将文件采集之后即删除。还要考虑上传到hdfs时大量元数据块的问题，要设置rollCount参数。

采集方案分两级：

第一级（hadoop1）：spooldir-avro

# example.conf: A single-node Flume configuration
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

#定义组的属性
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
#由该组负责负载均衡
a1.sinkgroups.g1.processor.type = load_balance

#定义负载均衡
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random
a1.sinkgroups.g1.processor.selector.maxTimeOut = 30000

# Name the components on this agent

# Describe/configure the source
#配置数据源
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/nocv_data
a1.sources.r1.fileHeader = true
#保存原文件名发送给第二级
a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = fileName

#存csv
a1.sources.r1.deserializer.maxLineLength =1048576
# 上传到hdfs后删除该文件
a1.sources.r1.deletePolicy = immediate
a1.sources.r1.consumeOrder = oldest
a1.sources.r1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
#组装数据源的管道
a1.sources.r1.channels = c1

# Describe the sink
# 定义数据的目的地1（下沉）
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop2
a1.sinks.k1.port = 4545

# 定义数据的目的地2（下沉）
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.hostname = hadoop3
a1.sinks.k2.port = 4545


# Use a channel which buffers events in memory
# 定义管道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


#启动flume
#flume-ng agent --conf conf --conf-file conf/spooldir-avro.conf --name a1 -Dflume.root.logger=INFO,console

第二级（hadoop2）：avro-hdfs


# example.conf: A single-node Flume configuration

# Name the components on this agent
# 定义代理的名字a1及各个组件sources、sinks和channels

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 定义数据源

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop2
a1.sources.r1.port = 4545
# Describe the sink
#描述数据的目的地(下沉)
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /nocv_data/%{fileName}
#保存原文件名
a1.sinks.k1.hdfs.filePrefix = %{fileName}

#存csv
a1.sinks.k1.hdfs.serializer=DELIMITED
a1.sinks.k1.hdfs.serializer.delimiter='\t'
a1.sinks.k1.hdfs.serializer.serdeSeparator='\t'
a1.sinks.k1.hdfs.serializer.fieldnames=group_name,pc_count,violation_pc_count,compliance_pc_count,quarantine_pc_count


#是否启用时间上的”舍弃”，这里的”舍弃”，类似于”四舍五入” 如果启用，则会影响除了%t的其他所有时间表达式
a1.sinks.k1.hdfs.round = false
#时间上进行“舍弃”的值
a1.sinks.k1.hdfs.roundValue = 10
#时间上进行”舍弃”的单位，包含：second,minute,hour
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用当地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#解决大量元数据块
#hdfs sink间隔多长将临时文件滚动成最终目标文件 如果设置成0，则表示不根据时间来滚动文件
a1.sinks.k1.hdfs.rollInterval = 10
#当临时文件达到该大小（单位：bytes）时，滚动成目标文件
a1.sinks.k1.hdfs.rollSize = 0
#当events数据达到该数量时候，将临时文件滚动成目标文件
a1.sinks.k1.hdfs.rollCount = 50
#每个批次刷新到HDFS上的events数量
a1.sinks.k1.hdfs.batchSize = 100
#当使用DataStream时候，文件不会被压缩，不需要设置hdfs.codeC
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
# 定义管道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
# 组装组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

# 启动命令
#flume-ng agent --conf conf --conf-file conf/spooldir-hdfs.conf --name a1 -Dflume.root.logger=INFO,console

启动顺序是，先第二级后第一级。之后我们启动flume后，运行爬虫程序，这样就可以实现flume实时监控文件夹内容并上传数据到hdfs上。