前言
方案介绍:
flume采用spoolDir的方式将文件传输到HDFS
因为一份文件要备份,一份文件要解析,因此使用了2个sink 对应2个channel
flume的 RegexExtractorExtInterceptor是根据源码重新编写的,功能是以文件名为header,分解header的值,来创建hadoop的目录,达到收集-分散到指定目录的效果.
ps: RegexExtractorExtInterceptor打成jar放在flume的lib文件夹下即可
另外需要一个程序将各个服务器对应地址的日志文件采集到flume所在服务器的spoolDir文件夹内.
方式随意,目前采用了一个java程序定时取,写shell脚本scp拿也行.
一.搭建分布式hadoop环境
1.JDK环境配置(省略)
2.SSH公钥私钥配置
参考:http://www.cnblogs.com/tankaixiong/p/4172942.html
3.Host设置
vi /etc/hosts
192.168.183.130 hadoop130
192.168.183.131 hadoop131
192.168.183.132 hadoop132
192.168.183.133 hadoop133
4.安装hadoop
tar -xvf hadoop-2.7.2.tar.gz
mv hadoop-2.7.2 hadoop
5.设置hadoop环境变量
vi /etc/profile
export HADOOP_HOME=/opt/hadoop/
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
6.修改hadoop配置文件
vi core-site.xml
< configuration >
< property >
< name > hadoop.tmp.dir </ name >
< value > /usr/hadoop/tmp </ value >
< description >(备注:请先在 /usr/hadoop 目录下建立 tmp 文件夹) A base for other temporary directories. </ description >
</ property >
< property >
< name > fs.default.name </ name >
< value > hdfs://192.168.183.130:9000 </ value >
</ property >
</ configuration >
备注:如没有配置hadoop.tmp.dir参数,此时系统默认的临时目录为:/tmp/hadoo-hadoop。而这个目录在每次重启后都会被干掉,必须重新执行format才行,否则会出错。
vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/hdfs/data</value>
</property>
</configuration>
如果需要yarn
vi yarn-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.149.128:9001</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.183.130</value>//主机IP
</property>
</configuration>
注意上面一定要填Ip,不要填localhost,不然eclipse会连接不到!
设置主从关系$HADOOP_HOME/etc/hadoop/目录下:
vi masters
192.168.183.130
192.168.183.130
//主机特有,从机可以不需要
vi slaves
192.168.183.131
192.168.183.132
192.168.183.133
192.168.183.131
192.168.183.132
192.168.183.133
hadoop namenode -format //第一次需要
启动:
sbin/start-all.sh
查看状态:主机
jps
2751 ResourceManager
2628 SecondaryNameNode
2469 NameNode
查看状态:从机
二.flume安装与配置
1.安装flumetar -xvf apache-flume-1.5.0-bin.tar.gz
mv apache-flume-1.5.0-bin flume
2.配置环境变量
vim /etc/profile
export FLUME_HOME=/opt/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin
vim /opt/flume/conf/flume-env.sh
指定文件内的jdk地址
3.检查是否安装成功
cd /opt/flume/bin/
./flume-ng version
能看到版本信息说明安装成功
4.按照前言的运用方式配置flume
vim /opt/flume/conf/flume-conf.properties
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.source.r1.selector.type = replicating
a1.sinks.k2.channel=c2
a1.source.r1.selector.type = replicating
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
a1.sources.r1.deletePolicy=immediate
a1.sources.r1.basenameHeader=true
#忽略copy时的.tmp文件,避免同时读写的问题
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)
a1.sources.r1.interceptors.i1.extractorHeader=true
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
a1.sources.r1.interceptors.i1.serializers.s1.name=filename
a1.sources.r1.interceptors.i1.serializers.s2.name=date
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix
a1.sources.r1.channels = c1 c2
# Describe the sink
a1.sinks.k1.type =hdfs
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 128000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.idleTimeout=60
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c1.type = memory
#a1.channels.c1.capacity = 1000
#a1.channels.c1.transactionCapacity = 200
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
# Bind the source and sink to the channel
a1.sinks.k1.channel = c1
# Describe the sink
a1.sinks.k2.type =hdfs
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k2.hdfs.rollSize = 128000000
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.batchSize = 100
a1.sinks.k2.hdfs.idleTimeout=60
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c2.type = memory
#a1.channels.c2.capacity = 1000
#a1.channels.c2.transactionCapacity = 200
a1.channels.c2.type = file
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
5.启动flume
cd /opt/flume
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&
ps:所需用的包
http://pan.baidu.com/s/1hshgS4G