Flume采集日志进入HDFS以及Hadoop环境搭建

前言
 
方案介绍:
flume采用spoolDir的方式将文件传输到HDFS
因为一份文件要备份,一份文件要解析,因此使用了2个sink 对应2个channel
flume的
RegexExtractorExtInterceptor是根据源码重新编写的,功能是以文件名为header,分解header的值,来创建hadoop的目录,达到收集-分散到指定目录的效果.

ps:
RegexExtractorExtInterceptor打成jar放在flume的lib文件夹下即可

另外需要一个程序将各个服务器对应地址的日志文件采集到flume所在服务器的spoolDir文件夹内.
方式随意,目前采用了一个java程序定时取,写shell脚本scp拿也行. 





 
 一.搭建分布式hadoop环境
 
1.JDK环境配置(省略)

2.SSH公钥私钥配置
参考:http://www.cnblogs.com/tankaixiong/p/4172942.html

3.Host设置
vi /etc/hosts
192.168.183.130 hadoop130
192.168.183.131  hadoop131
192.168.183.132  hadoop132
192.168.183.133  hadoop133

4.安装hadoop
tar -xvf hadoop-2.7.2.tar.gz
mv 
 hadoop-2.7.2 hadoop
 

5.设置hadoop环境变量
vi /etc/profile
export HADOOP_HOME=/opt/hadoop/
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
 
6.修改hadoop配置文件 
cd /opt/hadoop/conf
vi core-site.xml


< configuration >  
< property >  
< name > hadoop.tmp.dir </ name >  
< value > /usr/hadoop/tmp </ value >
 
< description >(备注:请先在 /usr/hadoop 目录下建立 tmp 文件夹) A base for other temporary directories. </ description >
 
  </ property >  
< property >
 
  < name > fs.default.name </ name >
 
  < value > hdfs://192.168.183.130:9000 </ value >
 
  </ property >
 
</ configuration >

 备注:如没有配置hadoop.tmp.dir参数,此时系统默认的临时目录为:/tmp/hadoo-hadoop。而这个目录在每次重启后都会被干掉,必须重新执行format才行,否则会出错。


vi hdfs-site.xml

<configuration>
 
 <property> 
<name>dfs.replication</name> 
<value>1</value>
 
 </property> 
<property> 
<name>dfs.namenode.name.dir</name>
 
 <value>file:/opt/hadoop/hdfs/name</value>
 
 <final>true</final> 
</property> 
<property> 
<name>dfs.datanode.data.dir</name>
 
 <value>file:/opt/hadoop/hdfs/data</value>
 
 </property>
 
</configuration>


如果需要yarn
vi yarn-site.xml 
<configuration> 
<property>
 
 <name>mapred.job.tracker</name> 
<value>192.168.149.128:9001</value>
 
 </property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>192.168.183.130</value>//主机IP
 </property>
 
</configuration>
 
注意上面一定要填Ip,不要填localhost,不然eclipse会连接不到!

 
设置主从关系$HADOOP_HOME/etc/hadoop/目录下:
vi masters 
192.168.183.130


//主机特有,从机可以不需要
vi slaves

192.168.183.131
192.168.183.132
192.168.183.133
 
hadoop namenode -format   //第一次需要
 
启动:
sbin/start-all.sh
 
查看状态:主机
 jps
2751 ResourceManager
2628 SecondaryNameNode
2469 NameNode

查看状态:从机
jps
1745 NodeManager
1658 DataNode
 

 
总共有5个hadoop线程 
 
访问地址查看hdfs 的运行状态:
 
 
二.flume安装与配置
1.安装flume
tar -xvf apache-flume-1.5.0-bin.tar.gz
mv 
apache-flume-1.5.0-bin  flume

2.配置环境变量
vim /etc/profile
export FLUME_HOME=/opt/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin

vim /opt/flume/conf/flume-env.sh
指定文件内的jdk地址 

3.检查是否安装成功
cd /opt/flume/bin/
./flume-ng version
能看到版本信息说明安装成功

4.按照前言的运用方式配置flume
vim /opt/flume/conf/flume-conf.properties

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

a1.source.r1.selector.type = replicating
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
a1.sources.r1.deletePolicy=immediate
a1.sources.r1.basenameHeader=true
#忽略copy时的.tmp文件,避免同时读写的问题
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$   

a1.sources.r1.interceptors=i1  
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder  
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)  
a1.sources.r1.interceptors.i1.extractorHeader=true  
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename  
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
a1.sources.r1.interceptors.i1.serializers.s1.name=filename  
a1.sources.r1.interceptors.i1.serializers.s2.name=date  
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix 
a1.sources.r1.channels = c1 c2

# Describe the sink
a1.sinks.k1.type =hdfs
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 128000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.idleTimeout=60
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
#a1.channels.c1.type = memory
#a1.channels.c1.capacity = 1000
#a1.channels.c1.transactionCapacity = 200

a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1


# Bind the source and sink to the channel
a1.sinks.k1.channel = c1




# Describe the sink
a1.sinks.k2.type =hdfs
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k2.hdfs.rollSize = 128000000
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.batchSize = 100
a1.sinks.k2.hdfs.idleTimeout=60
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
#a1.channels.c2.type = memory
#a1.channels.c2.capacity = 1000
#a1.channels.c2.transactionCapacity = 200

a1.channels.c2.type = file
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2

a1.sinks.k2.channel=c2  

 

5.启动flume
cd /opt/flume
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&


 
 
ps:所需用的包

http://pan.baidu.com/s/1hshgS4G
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值