数据采集工具
在一个完整的离线大数据处理系统中,除了hdfs+mapreduce+hive组成分析系统的核心之外,还需要数据采集、结果数据导出、任务调度等不可或缺的辅助系统,而这些辅助工具在hadoop生态体系中都有便捷的开源框架,如图所示:
图:典型大规模离线数据处理平台
Flume日志采集框架
1. Flume的安装部署
1、Flume的安装非常简单,只需要解压即可,当然,前提是已有hadoop环境
上传安装包到数据源所在节点上
然后解压 tar -zxvf apache-flume-1.6.0-bin.tar.gz
然后进入flume的目录,修改conf下的flume-env.sh,在里面配置JAVA_HOME
2、根据数据采集的需求配置采集方案,描述在配置文件中(文件名可任意自定义)
3、指定采集方案配置文件,在相应的节点上启动flume agent
1、先在flume的conf目录下新建一个配置文件(采集方案)
[ ] vi netcat-logger.properties (文件名可任意自定义)
#定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#描述和配置source组件:r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#描述和配置sink组件:k1
a1.sinks.k1.type = logger
#描述和配置channel组件,此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#描述和配置source channel sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2、启动agent去采集数据
[ ] bin/flume-ng agent
-c conf
-f conf/netcat-logger.conf
-n a1
-Dflume.root.logger=INFO,console
-c conf #指定flume自身的配置文件所在目录
-f conf/netcat-logger.con #指定我们所描述的采集方案
-n a1 指定我们这个agent的名字
2.配置1. (自定义source)
#bin/flume-ng agent -n a1 -f /root/a1.conf -c conf -Dflume.root.logger=INFO,console
#定义agent名, source、channel、sink的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#具体定义source
a1.sources.r1.type = com.lic.flume.TailFileSource
a1.sources.r1.filePath =/root/flume_test/access.txt
a1.sources.r1.posiFile =/root/flume_test/posi.txt
a1.sources.r1.interval = 2000
a1.sources.r1.charset = UTF-8
#具体定义channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#具体定义sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory =/root/flume_test/k1
#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.配置2(TaiDirSource):
#bin/flume-ng agent -n a2 -f /home/hadoop/a2.conf -c conf -Dflume.root.logger=INFO,console
#定义agent名, source、channel、sink的名称
a2.sources = r1 r2
a2.channels = c1 c2
a2.sinks = k1 k2
#具体定义source
a2.sources.r1.type = cn.edu360.flume.source.TailFileSource
a2.sources.r1.filePath = /Users/zx/Documents/logs/access.txt
a2.sources.r1.posiFile = /Users/zx/Documents/logs/pos.txt
a2.sources.r1.interval = 1000
a2.sources.r1.charset = UTF-8
a2.sources.r2.type = TAILDIR
#存储偏移量的文件,以json格式存储多个文件的偏移量
a2.sources.r2.positionFile = /Users/zx/Desktop/position.json
a2.sources.r2.filegroups = g1
#所有符合正则表达式的文件都会被监听
a2.sources.r2.filegroups.g1 = /Users/zx/Desktop/2017/.*.txt
a2.sources.r2.fileHeader = false
#具体定义channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#具体定义sink
a2.sinks.k1.type = file_roll
a2.sinks.k1.sink.directory = /Users/zx/Desktop/k1
a2.sinks.k2.type = file_roll
a2.sinks.k2.sink.directory = /Users/zx/Desktop/k2
#组装source、channel、sink
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
4.配置3(自定义source和kafkaChannel整合):
#bin/flume-ng agent -n a0 -f /root/a0.conf -c conf -Dflume.root.logger=INFO,console
#定义agent名, source、channel、sink的名称
a0.sources = r1
a0.channels = c1
#具体定义source
a0.sources.r1.type = cn.edu360.flume.source.TailFileSource
a0.sources.r1.filePath = /Users/zx/Documents/logs/access.txt
a0.sources.r1.posiFile = /Users/zx/Documents/logs/posi.txt
a0.sources.r1.interval = 2000
a0.sources.r1.charset = UTF-8
a0.sources.r1.interceptors = i1
a0.sources.r1.interceptors.i1.type = cn.edu360.flume.interceptor.JsonInterceptor$Builder
a0.sources.r1.interceptors.i1.fields = id,name,fv,age
a0.sources.r1.interceptors.i1.separator = ,
a0.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a0.channels.c1.kafka.bootstrap.servers = node6:9092,node7:9092,node8:9092
a0.channels.c1.kafka.topic = usertest
a0.channels.c1.parseAsFlumeEvent = false
a0.sources.r1.channels = c1
Sqoop数据迁移工具
sqoop安装
安装sqoop的前提是已经具备java和hadoop的环境
1、下载并解压
下载地址
2、修改配置文件
[ ] cd /root/hadoop/sqoop-1.4.6/conf
[ ] mv sqoop-env-template.sh sqoop-env.sh
打开sqoop-env.sh并编辑下面几行:
export HADOOP_COMMON_HOME= /root/hadoop/hadoop-2.8.4 export
HADOOP_MAPRED_HOME= /root/hadoop/hadoop-2.8.4 export
HBASE_HOME=/root/hadoop/hbase-1.2.1 export HIVE_HOME=
/root/hadoop/hive-1.2.1
3、加入mysql的jdbc驱动包到sqoop的lib包下
[ ] cp /root/hive/lib/mysql-connector-java-5.1.28.jar /root/Hadoop/sqoop-1.4.6/lib/
4、验证启动
cd /root/hadoop/sqoop-1.4.6/bin
sqoop-version
预期的输出:
15/12/17 14:52:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Sqoop
1.4.6 git commit id 5b34accaca7de251fc91161733f906af2eddbe83 Compiled by abe on Fri Aug 1 11:19:26 PDT 2015
到这里,整个Sqoop安装工作完成。
验证sqoop到mysql业务库之间的连通性:
[ ] bin/sqoop-list-databases --connect jdbc:mysql://localhost:3306 --username root --password root
[ ] bin/sqoop-list-tables --connect jdbc:mysql://localhost:3306/userdb --username root --password root