apache flume 常用的一些配置

最新推荐文章于 2021-09-21 16:36:39 发布

我终于有blog了

最新推荐文章于 2021-09-21 16:36:39 发布

阅读量375

点赞数 1

分类专栏：大数据文章标签： flumn

本文链接：https://blog.csdn.net/qq_29493353/article/details/80079347

版权

大数据专栏收录该内容

15 篇文章 1 订阅

订阅专栏

1.http source到hdfs sink（根据传入json不同分配道不同hive表，两种方法）注：hive表只是hdfs一个文件夹

（1）httpsource：

            agent.sources.httpSource.type = http
            agent.sources.httpSource.port = 5140
            agent.sources.httpSource.handler = org.apache.flume.source.http.JSONHandler
            agent.sources.httpSource.bind = localhost
            agent.sources.httpSource.channels = memoryChannel memoryChannel1
            agent.sources.httpSource.selector.type = multiplexing #用于分流
            agent.sources.httpSource.selector.header = table_name#获取event header的参数，这里叫tablename
            agent.sources.httpSource.selector.mapping.odl_log = memoryChannel#通道1

agent.sources.httpSource.selector.mapping.odl_log1 = memoryChannel1#通道2

channel:

创建多个通道

hdfs sink：（不同hdfs.path对应不同sink）

          agent.sinks.kafka2hive_general.type = hdfs
            agent.sinks.kafka2hive_general.hdfs.rollSize = 10485760
          agent.sinks.kafka2hive_general.hdfs.rollInterval= 0

agent.sinks.kafka2hive_general.hdfs.rollCount= 0

agent.sinks.kafka2hive_general.hdfs.path = /user/hive/warehouse/db.db/table

agent.sinks.kafka2hive_general.channel = memoryChannel
agent.sinks.kafka2hive_general.hdfs.fileType = DataStream
agent.sinks.kafka2hive_general.hdfs.writeFormat = Text

agent.sinks.kafka2hive_general.hdfs.idleTimeout = 600

            agent.sinks.kafka2hive_general.type = hdfs
            agent.sinks.kafka2hive_general.hdfs.rollSize = 10485760
          agent.sinks.kafka2hive_general.hdfs.rollInterval= 0

agent.sinks.kafka2hive_general.hdfs.rollCount= 0

agent.sinks.kafka2hive_general.hdfs.path = /user/hive/warehouse/db.db/table1

agent.sinks.kafka2hive_general.channel = memoryChannel1
agent.sinks.kafka2hive_general.hdfs.fileType = DataStream
agent.sinks.kafka2hive_general.hdfs.writeFormat = Text

agent.sinks.kafka2hive_general.hdfs.idleTimeout = 600

（2）第二种方法（将http传递的header带上参数table_name,在sink时获取value值）

curl -X POST -d '[{ "headers" : {"timestamp" : "434324343", "host" :"random_host.example.com", "table_name" : "table" }, "body" : "random_body" }]' localhost:9000

            a1.sinks.k1.type = hdfs
            a1.sinks.k1.hdfs.path = /user/hive/warehouse/db.db/%{table_name}
            a1.sinks.k1.hdfs.filePrefix = events-
            a1.sinks.k1.hdfs.round = true
            a1.sinks.k1.hdfs.roundValue = 10
            a1.sinks.k1.hdfs.roundUnit = minute

2.kafka source 到hdfs sink

      api_channel.sources = kafka2hive_general
            api_channel.channels = kafka2hive_general
            api_channel.sinks = kafka2hive_general
            api_channel.sources.kafka2hive_general.type = org.apache.flume.source.kafka.KafkaSource
            api_channel.sources.kafka2hive_general.zookeeperConnect = *******
            api_channel.sources.kafka2hive_general.topic = ***
            api_channel.sources.kafka2hive_general.groupId = ****
            api_channel.sources.kafka2hive_general.channels = kafka2hive_general
            api_channel.sources.kafka2hive_general.kafka.consumer.auto.offset.reset = smallest
            api_channel.sources.kafka2hive_general.interceptors = i1
            api_channel.sources.kafka2hive_general.interceptors.i1.type = regex_extractor
            api_channel.sources.kafka2hive_general.interceptors.i1.regex = ^(\\w*),.*$
            api_channel.sources.kafka2hive_general.interceptors.i1.serializers = extract
            api_channel.sources.kafka2hive_general.interceptors.i1.serializers.extract.name = table_name
            api_channel.channels.kafka2hive_general.type = memory
            api_channel.channels.kafka2hive_general.capacity = 10000
            api_channel.channels.kafka2hive_general.transactionCapacity = 5000
            api_channel.channels.kafka2hive_general.keep-alive = 60
            api_channel.sinks.kafka2hive_general.type = hdfs
            api_channel.sinks.kafka2hive_general.hdfs.rollSize = 10485760
            api_channel.sinks.kafka2hive_general.hdfs.rollInterval= 0
            api_channel.sinks.kafka2hive_general.hdfs.rollCount= 0
        api_channel.sinks.kafka2hive_general.hdfs.path = /user/hive/warehouse/a.db/%{table_name}/ds=%Y-%m-%d-%H/
            api_channel.sinks.kafka2hive_general.channel = kafka2hive_general
            api_channel.sinks.kafka2hive_general.hdfs.fileType = DataStream
            api_channel.sinks.kafka2hive_general.hdfs.writeFormat = Text
            api_channel.sinks.kafka2hive_general.hdfs.idleTimeout = 600

3.其他一些比如监控文件夹的spool，监控文件的exec tail和其他数据源请看 http://flume.apache.org/FlumeUserGuide.html

4.重新定义source和sink的请看 http://flume.apache.org/FlumeDeveloperGuide.html 和https://blog.csdn.net/yanshu2012/article/details/53391070

我终于有blog了

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
apache flume 常用的一些配置

1.http source到hdfs sink（根据传入json不同分配道不同hive表，两种方法）注：hive表只是hdfs一个文件夹（1）httpsource：agent.sources.httpSource.type = httpagent.sources.httpSource.port = 5140...
复制链接

扫一扫