flume 抽取图片文件数据写入到HDFS

最新推荐文章于 2022-03-03 08:21:11 发布

春日部动感超人

最新推荐文章于 2022-03-03 08:21:11 发布

阅读量3.5k

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/u013430507/article/details/78674391

版权

hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

flume 是一个日志处理的工具，其擅长处理文本数据。不过在有些使用场景，比如采集服务器上的很多小的图片数据时，也可以派上用场。
话不多说，直接上flume-conf配置信息：

# ==== start ====
agent.sources = spooldirsource
agent.channels = memoryChannel
agent.sinks = hdfssink

# For each one of the sources, the type is defined
agent.sources.spooldirsource.type = spooldir

# The channel can be defined as follows.
agent.sources.spooldirsource.channels = memoryChannel

agent.sources.spooldirsource.spoolDir = /data/mcmin/imgfiles

agent.sources.spooldirsource.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

agent.sources.spooldirsource.deserializer.maxBlobLength = 100000000

# Each sink's type must be defined
agent.sinks.hdfssink.type = hdfs

#Specify the channel the sink should use
agent.sinks.hdfssink.channel = memoryChannel

# ns1 是高可用地址 
# /%Y/%m/%d 根据日期动态写目录
agent.sinks.hdfssink.hdfs.path = hdfs://ns1/mcmin/%Y/%m/%d

agent.sinks.hdfssink.hdfs.useLocalTimeStamp = true

agent.sinks.hdfssink.hdfs.fileSuffix = .jpg

agent.sinks.hdfssink.hdfs.fileType = DataStream

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100000

# ==== end ====

这里的需要注意的有两点：
1: spooldirsource的deserializer 声明为org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
2: hdfs sink 的 fileType 要声明为 DataStream

这里，flume的spooldirsource把每一个图片文件，都封装成一个单独event，不跟处理文本数据一样（文本数据是把文本的每一行内容都封装成一个event）。而且他是把这些event缓存在内存中，所以，当一次性处理大量图片文件或者说图片大小较大时，容易撑爆内存。这是需要注意的。

春日部动感超人

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
flume 抽取图片文件数据写入到HDFS

flume 是一个日志处理的工具，其擅长处理文本数据。不过在有些使用场景，比如采集服务器上的很多小的图片数据时，也可以派上用场。话不多说，直接上flume-conf配置信息：# ==== start ====agent.sources = spooldirsourceagent.channels = memoryChannelagent.sinks = hdfssink# For each
复制链接

扫一扫