flume将日志到hive实现

最新推荐文章于 2022-11-11 16:24:28 发布

qqpy789

最新推荐文章于 2022-11-11 16:24:28 发布

阅读量4.3k

点赞数

文章标签： hadoop hive flume

本文链接：https://blog.csdn.net/qqpy789/article/details/48470563

版权

本文介绍了如何使用Apache Flume将日志从一台机器(node1)收集并传输到另一台机器(master)，最终存储到HDFS并导入到Hive中。通过在node1上设置exec source读取日志，利用memory channel和avro sink发送数据到master。在master上，Flume使用avro source接收数据，并通过hdfs sink写入HDFS。由于Hive导入问题，采取了直接将文件复制到Hive表对应location的解决方案。详细配置及启动步骤文中均有说明。

摘要由CSDN通过智能技术生成

科普：flume是apache下的一个日志收集系统，主要由source+channel+ sink组成：source可以看做是源，也就是日志的来源，本例子是用exec source；channel可以看做是中转的路，可以是文件也可以是内存；sink是输出，一般有hive sink，hbase sink,hdfs sink,avro sink。当然一个机器可以有多个source+channel+ sink。

资源： 172.16.6.152 node1 安装有flume+datanode

172.16.6.151 master 安装有flume +hive+ namenode

思路:1.node1机器中使用exec source 执行tail -F /*/*.log获取日志的source，然后使用memory channel，然后使用avro传输至master

2.master中的flume 的souce是avro，接收node1机器所发送过来的数据，经过memory channel,最后使用hdfs sink写入hdfs中

3.由于本人水平有限，本来想直接在master中使用hive sink，无奈一直报hive class找不着的错误。所以就使用了另外一个损招，hive导入数据，是可以直接复制文件进入到hive表所对应的location中，所以就有了解决办法。

附上详细的设置：

1.flume安装，解压，然后修改配置文件conf/flume-env.sh 设置环境变量

2.node1节点，在conf目录下生成example4.conf文件，example4.conf文件内容如下

# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 100000
agent1.channels.ch1.transactionCapacity = 100000
agent1.channels.ch1.keep-alive = 30

# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
#agent1.sources.avro-source1.channels = ch1
#agent1.sources.avro-source1.type = avro
#agent1.sources.avro-source1.bind = 0.0.0.0
#agent1.sources.avro-source1.port = 41414
#agent1.sources.avro-source1.threads = 5

#define source monitor a file
agent1.sources.avro-source1.type = exec
agent1.sources.avro-source1.shell = /bin/bash -c
agent1.sources.avro-source1.command = tail -F /opt/cdh5.3.0/hadoop/logs/hadoop-hadoop-namenode-node1.lo