flume的使用
1.收集日志
2.数据处理
3.什么是flume
4.fliume的部署
5.event
6.flume的使用
1.采集数据到logger(控制台)
1.netca
2.exec
3.spooldir
4.taildir
2.输入文件到hdfs(sink hdfs)
1.文件内容
2.解决小文件
3.输入文件到hive
1.hive 普通表
2.hive 分区表
3.hive sink
4.hive 普通表+table 开启事务【Acid】
4.文件压缩和file
5.avro
1.收集日志
A => batchSize
数据采集:把数据采集到服务器上
数据收集:把数据移动到指定位置
2,数据处理:
1.离线处理:批处理
数据已经放在那
2.实时处理:
产生一条数据 处理一次
3.flume
1.官网: flume.apache.org
2.流程 :
collecting 采集/收集 source
aggregating 聚合 channel
moving 移动 sink
3.streaming data flows flume采集数据 实时采集数据
4.核心概念:user job:就是编写agent里面的配置
agent:
source channel sink
source:采集数据
channel: 存储采集过来的数据
sink: 把采集来的数据发送出去
4.部署
1.解压
2,环境变量
3.配置flume
vim /home/hadoop/app/flume/lib/flume-env.sh
export JAVA_HOME=/home/hadoop/app/java
5.event:一条数据
headers:描述信息
body: 存的是实实在在的数据
报错: headers: null
body: 1
目的:正确的数据落到正确的目录下
6.flume的使用
1.采集数据到logger(控制台)
1.netcat:
从 指定端口
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#netcat方法
a1.sources.r1.type = netcat
#本地
a1.sources.r1.bind = localhost
#端口号
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
#sink的类型为looger(控制台)
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/nc-mem-logger.conf \
-Dflume.root.logger=info,console
开启端口
telnet localhost 4444
nc -k -l
2.exec
从 指定文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#exec方法
a1.sources.r1.type = exec
#实时监控+文件地址
a1.sources.r1.command = tail -F /home/hadoop/emp/flume/1.log
a1.channels.c1.type = memory
#sink的类型为looger(控制台)
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/exec-mem-logger.conf \
-Dflume.root.logger=info,console
exec问题:
1. tail -F
2.采集数据后flume挂掉后数据再次写入
3.spooldir
从 指定文件夹的内容
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#spooldir方法
a1.sources.r1.type = spooldir
#文件夹路径
a1.sources.r1.spoolDir = /home/hadoop/emp/flume/test/
a1.channels.c1.type = memory
#sink的类型为looger(控制台)
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/spooldir-mem-logger.conf \
-Dflume.root.logger=info,console
4.taildir
从 指定文件夹和文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#taildir方法
a1.sources.r1.type = TAILDIR
#f1 f2...进行采集
a1.sources.r1.filegroups = f1 f2
#f1地址
a1.sources.r1.filegroups.f1=/home/hadoop/emp/flume/1.log
#f2地址
a1.sources.r1.filegroups.f2=/home/hadoop/emp/flume/test/.*.log
a1.channels.c1.type = memory
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/taildir-mem-logger.conf \
-Dflume.root.logger=info,console
2.输入文件到hdfs(sink hdfs)
1.文件内容
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1=/home/hadoop/emp/flume/1.log
a1.channels.c1.type = memory
#sink的类型为hdfs
a1.sinks.k1.type = hdfs
#hdfs地址
a1.sinks.k1.hdfs.path=hdfs://bigdata13:9000/flume/log/
#输出文件格式为数据流 (不设置会是乱码)
a1.sinks.k1.hdfs.fileType=DataStream
#输出文件格式
a1.sinks.k1.hdfs.writeFormat=Text
#文件前缀
a1.sinks.k1.hdfs.filePrefix=events
#文件后缀
a1.sinks.k1.hdfs.fileSuffix=.log
#使用本机时间 可能出现本机时间不对
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#文件滚动
#每60s采集到文件一次
a1.sinks.k1.hdfs.rollInterval=60
#没128G采集到文件一次
a1.sinks.k1.hdfs.rollSize=134217728
#每1000行采集到文件一次
a1.sinks.k1.hdfs.rollCount=1000
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2.解决小文件
1.hdfs.batchSize 不用
2.两大类(可能有用)
hdfs.round =》是否开启文件滚动
1.按照条数文件发生滚动
hdfs.rollSize
2.按照时间 文件发生滚动
hdfs.roundUnit => 时间滚动单元 second,minute or hour
hdfs.roundValue => 时间具体值
3.有用
hdfs.rollInterval =》 按照时间滚动(秒)
hdfs.rollSize => 按照文件大小 (134,217,728 =》 128G)
hdfs.rollCount => 按照hdfs文件数据条数 (条)
4.文件内容
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1=/home/hadoop/emp/flume/1.log
a1.channels.c1.type = memory
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://bigdata13:9000/flume/log/
a1.sinks.k1.hdfs.fileType=DataStream
#了解
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.roundValue=1
#文件滚动
a1.sinks.k1.hdfs.rollInterval=60
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollCount=10
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/taildir-mem-hdfs-round.conf \
-Dflume.root.logger=info,console
3.输入文件到hive
1.hive 普通表
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
#当地文件
a1.sources.r1.filegroups.f1=/home/hadoop/emp/1.txt
a1.channels.c1.type = memory
#sink类型为 hdfs
a1.sinks.k1.type = hdfs
#hive路径
a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/user/hive/warehouse/bigdata_hive.db/emp
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.filePrefix=events
a1.sinks.k1.hdfs.fileSuffix=.log
a1.sinks.k1.hdfs.rollInterval=60
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollCount=100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/hive/taildir-mem--hdfs-emp.conf \
-Dflume.root.logger=info,console
2.hive 分区表
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1=/home/hadoop/tmp/000000_0
a1.channels.c1.type = memory
a1.sinks.k1.type = hdfs
# hive路径+分区字段
a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/user/hive/warehouse/bigdata_hive.db/emp_p/deptno=10
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.filePrefix=events
a1.sinks.k1.hdfs.fileSuffix=.log
a1.sinks.k1.hdfs.rollInterval=60
a1.sinks.k1.hdfs.rollSize=134217728
a1.sinks.k1.hdfs.rollCount=100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/hive/taildir-mem-hdfs-emp_p.conf \
-Dflume.root.logger=info,console
3.hive sink
1.emp.txt
2.hive emp 普通表
souce:taildir
channel:mem
sink:hivesink
3.文件内容
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1=/home/hadoop/tmp/emp.txt
a1.channels.c1.type = memory
a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore= => 需要hive 启动metastore 服务
a1.sinks.k1.hive.database=bigdata_hive
a1.sinks.k1.hive.table=emp
a1.sinks.k1.serializer=DELIMITED ==>指定表中字段分割符
a1.sinks.k1.serializer.delimiter=','
a1.sinks.k1.serializer.fieldnames=empno,ename,job,mgr,hiredate,sal,comm,deptno
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
---------------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1=/home/hadoop/project/flume/hive/bucket_00000
a1.channels.c1.type = memory
a1.channels.c1.transactionCapacity=15000
a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore= thrift://127.0.0.1:9083
a1.sinks.k1.hive.database=bigdata_hive
a1.sinks.k1.hive.table=emp
a1.sinks.k1.serializer=DELIMITED
a1.sinks.k1.serializer.delimiter=','
a1.sinks.k1.serializer.fieldnames=empno,ename,job,mgr,hiredate,sal,comm,deptno
a1.sinks.k1.batchSize=100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
报错:1.设置channels为15000 或者sinks为100(默认15000)
让channels >= sinks
2.flume lib目录下添加hive-hcatalog-streaming-3.1.3.jar
启动:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/hive/taildir-mem-hive-emp.conf \
-Dflume.root.logger=info,console
4.hive 普通表+table 开启事务【Acid】
1.差别:
1.source emp.txt =>行式存储
2.table hive acid orc 列式存储
加入数据:insert into table table_name select * from emp.txt
2.sink:
hdfs
hive => hdfs
logger (控制台)
avro +》序列化
3.通常不需要 双层flume
4. log => flume => hdfs
=> 实时计算
=》kafka =》实时计算
4.文件压缩
5.avro:第一个agent的 sink 作为 第二个 agent的source
要求: 读取1111端口数据 数据发送到2222端口 最终2222端口 把数据写入hdfs
agent:
nc-mem-avro (开启端口)
avro-mem-hdfs (将1111的数据传入2222)
avro-mem-logger (将2222的数据打印到控制台)
agent1:telnet localhost 1111
agent2: nc-mem-avro.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 1111
a1.channels.c1.type = memory
#sink的类型为avro
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=bigdata13
a1.sinks.k1.port=2222
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/avro/nc-mem-avro.conf \
-Dflume.root.logger=info,console
agent3: avro-mem-logger.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = bigdata13
a1.sources.r1.port = 2222
a1.channels.c1.type = memory
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
flume-ng agent \
--name a1 \
--conf ${FLUME_HOME}/conf \
--conf-file /home/hadoop/project/flume/avro/avro-mem-logger.conf \
-Dflume.root.logger=info,console
启动顺序:agent3 ->agent2 ->agent1