fluem的使用

lcatake

已于 2022-12-14 14:58:50 修改

阅读量199

点赞数

分类专栏： flume 文章标签： hadoop hive 大数据

于 2022-12-13 20:16:11 首次发布

本文链接：https://blog.csdn.net/lcatake/article/details/128306931

版权

flume 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1.收集日志

2.数据处理

3.什么是flume

4.fliume的部署

5.event

6.flume的使用

1.采集数据到logger(控制台)

1.netca

2.exec

3.spooldir

4.taildir

2.输入文件到hdfs(sink hdfs)

1.文件内容

2.解决小文件

3.输入文件到hive

1.hive 普通表

2.hive 分区表

3.hive sink

4.hive 普通表+table 开启事务【Acid】

4.文件压缩和file

5.avro

1.收集日志
A => batchSize
数据采集：把数据采集到服务器上
数据收集：把数据移动到指定位置

2,数据处理：
1.离线处理：批处理
数据已经放在那
2.实时处理：
产生一条数据处理一次

3.flume
1.官网： flume.apache.org
2.流程：
collecting 采集/收集 source
aggregating 聚合 channel
moving 移动 sink

   3.streaming data flows  flume采集数据 实时采集数据
   4.核心概念：user job：就是编写agent里面的配置
              agent：
                    source channel sink
                    source：采集数据
                    channel: 存储采集过来的数据
                    sink： 把采集来的数据发送出去

4.部署
1.解压
2，环境变量
3.配置flume

      vim /home/hadoop/app/flume/lib/flume-env.sh
       export JAVA_HOME=/home/hadoop/app/java

5.event：一条数据

   headers：描述信息
   body： 存的是实实在在的数据
  报错： headers： null
         body: 1
        目的:正确的数据落到正确的目录下

6.flume的使用
1.采集数据到logger(控制台)
1.netcat：
从指定端口

            a1.sources = r1
            a1.sinks = k1
            a1.channels = c1
            #netcat方法
            a1.sources.r1.type = netcat
            #本地 
            a1.sources.r1.bind = localhost 
            #端口号
            a1.sources.r1.port = 44444   
            a1.channels.c1.type = memory
            #sink的类型为looger(控制台)
            a1.sinks.k1.type = logger     
            a1.sources.r1.channels = c1
            a1.sinks.k1.channel = c1
      启动
            flume-ng agent \
            --name a1 \
            --conf ${FLUME_HOME}/conf \
            --conf-file /home/hadoop/project/flume/nc-mem-logger.conf \
            -Dflume.root.logger=info,console
            开启端口
            telnet localhost 4444
            nc -k -l 


  2.exec 
          从 指定文件
            a1.sources = r1
            a1.sinks = k1
            a1.channels = c1
            #exec方法
            a1.sources.r1.type = exec
            #实时监控+文件地址
            a1.sources.r1.command = tail -F /home/hadoop/emp/flume/1.log 
            a1.channels.c1.type = memory
            #sink的类型为looger(控制台)
            a1.sinks.k1.type = logger
            a1.sources.r1.channels = c1
            a1.sinks.k1.channel = c1
      启动：
            flume-ng agent \
            --name a1 \
            --conf ${FLUME_HOME}/conf \
            --conf-file /home/hadoop/project/flume/exec-mem-logger.conf \
            -Dflume.root.logger=info,console


          exec问题：
              1. tail -F
              2.采集数据后flume挂掉后数据再次写入
  3.spooldir
          从 指定文件夹的内容 

            a1.sources = r1
            a1.sinks = k1
            a1.channels = c1
            #spooldir方法
            a1.sources.r1.type = spooldir
            #文件夹路径
            a1.sources.r1.spoolDir = /home/hadoop/emp/flume/test/
            a1.channels.c1.type = memory
            #sink的类型为looger(控制台)
            a1.sinks.k1.type = logger
            a1.sources.r1.channels = c1
            a1.sinks.k1.channel = c1
            启动：
            flume-ng agent \
            --name a1 \
            --conf ${FLUME_HOME}/conf \
            --conf-file /home/hadoop/project/flume/spooldir-mem-logger.conf \
            -Dflume.root.logger=info,console

  4.taildir
          从 指定文件夹和文件

            a1.sources = r1
            a1.sinks = k1
            a1.channels = c1
            #taildir方法
            a1.sources.r1.type = TAILDIR
            #f1 f2...进行采集
            a1.sources.r1.filegroups = f1 f2
            #f1地址
            a1.sources.r1.filegroups.f1=/home/hadoop/emp/flume/1.log
            #f2地址
            a1.sources.r1.filegroups.f2=/home/hadoop/emp/flume/test/.*.log
            a1.channels.c1.type = memory
            a1.sinks.k1.type = logger
            a1.sources.r1.channels = c1
            a1.sinks.k1.channel = c1
            
                  启动
            flume-ng agent \
            --name a1 \
            --conf ${FLUME_HOME}/conf \
            --conf-file /home/hadoop/project/flume/taildir-mem-logger.conf \
            -Dflume.root.logger=info,console

2.输入文件到hdfs(sink hdfs)
    1.文件内容     
               a1.sources = r1
               a1.sinks = k1
               a1.channels = c1
               a1.sources.r1.type = TAILDIR
               a1.sources.r1.filegroups = f1
               a1.sources.r1.filegroups.f1=/home/hadoop/emp/flume/1.log 
               a1.channels.c1.type = memory
               #sink的类型为hdfs
               a1.sinks.k1.type = hdfs
               #hdfs地址
               a1.sinks.k1.hdfs.path=hdfs://bigdata13:9000/flume/log/
               #输出文件格式为数据流 (不设置会是乱码)
               a1.sinks.k1.hdfs.fileType=DataStream
               #输出文件格式 
               a1.sinks.k1.hdfs.writeFormat=Text
               #文件前缀
               a1.sinks.k1.hdfs.filePrefix=events
               #文件后缀
               a1.sinks.k1.hdfs.fileSuffix=.log
               #使用本机时间 可能出现本机时间不对
               a1.sinks.k1.hdfs.useLocalTimeStamp=true
               #文件滚动
               #每60s采集到文件一次
               a1.sinks.k1.hdfs.rollInterval=60
               #没128G采集到文件一次
               a1.sinks.k1.hdfs.rollSize=134217728
               #每1000行采集到文件一次
               a1.sinks.k1.hdfs.rollCount=1000
               
               a1.sources.r1.channels = c1
               a1.sinks.k1.channel = c1
     2.解决小文件
            1.hdfs.batchSize 不用
            2.两大类(可能有用)
                  hdfs.round =》是否开启文件滚动
              1.按照条数文件发生滚动
                           hdfs.rollSize
              2.按照时间 文件发生滚动
                  hdfs.roundUnit => 时间滚动单元  second,minute or hour
                  hdfs.roundValue => 时间具体值
              3.有用
                  hdfs.rollInterval =》 按照时间滚动(秒)
                  hdfs.rollSize => 按照文件大小  (134,217,728 =》 128G)
                  hdfs.rollCount => 按照hdfs文件数据条数 (条)
              4.文件内容
                 a1.sources = r1
                 a1.sinks = k1
                 a1.channels = c1
                 a1.sources.r1.type = TAILDIR
                 a1.sources.r1.filegroups = f1
                 a1.sources.r1.filegroups.f1=/home/hadoop/emp/flume/1.log
                 a1.channels.c1.type = memory
                 a1.sinks.k1.type = hdfs
                 a1.sinks.k1.hdfs.path=hdfs://bigdata13:9000/flume/log/
                 a1.sinks.k1.hdfs.fileType=DataStream
                 #了解
                 a1.sinks.k1.hdfs.writeFormat=Text
                 a1.sinks.k1.hdfs.round=true
                 a1.sinks.k1.hdfs.roundUnit=minute
                 a1.sinks.k1.hdfs.roundValue=1
                 #文件滚动
                 a1.sinks.k1.hdfs.rollInterval=60
                 a1.sinks.k1.hdfs.rollSize=134217728
                 a1.sinks.k1.hdfs.rollCount=10
                 a1.sources.r1.channels = c1
                 a1.sinks.k1.channel = c1
        启动flume：
                 flume-ng agent \
                 --name a1 \
                 --conf ${FLUME_HOME}/conf \
                 --conf-file /home/hadoop/project/flume/taildir-mem-hdfs-round.conf \
                 -Dflume.root.logger=info,console
         

3.输入文件到hive
        1.hive 普通表
              a1.sources = r1
              a1.sinks = k1
              a1.channels = c1
              a1.sources.r1.type = TAILDIR
              a1.sources.r1.filegroups = f1
              #当地文件
              a1.sources.r1.filegroups.f1=/home/hadoop/emp/1.txt
              a1.channels.c1.type = memory
              #sink类型为 hdfs
              a1.sinks.k1.type = hdfs
              #hive路径
              a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/user/hive/warehouse/bigdata_hive.db/emp
             
              a1.sinks.k1.hdfs.fileType=DataStream
              
              a1.sinks.k1.hdfs.writeFormat=Text
              a1.sinks.k1.hdfs.filePrefix=events
              a1.sinks.k1.hdfs.fileSuffix=.log
              a1.sinks.k1.hdfs.rollInterval=60
              a1.sinks.k1.hdfs.rollSize=134217728
              a1.sinks.k1.hdfs.rollCount=100
              a1.sources.r1.channels = c1
              a1.sinks.k1.channel = c1
              启动flume：
              flume-ng agent \
              --name a1 \
              --conf ${FLUME_HOME}/conf \
              --conf-file /home/hadoop/project/flume/hive/taildir-mem--hdfs-emp.conf \
              -Dflume.root.logger=info,console
         2.hive 分区表
              a1.sources = r1
              a1.sinks = k1
              a1.channels = c1
              a1.sources.r1.type = TAILDIR
              a1.sources.r1.filegroups = f1
              a1.sources.r1.filegroups.f1=/home/hadoop/tmp/000000_0
              a1.channels.c1.type = memory
              a1.sinks.k1.type = hdfs
              # hive路径+分区字段
              a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/user/hive/warehouse/bigdata_hive.db/emp_p/deptno=10
              a1.sinks.k1.hdfs.fileType=DataStream
              a1.sinks.k1.hdfs.writeFormat=Text
              a1.sinks.k1.hdfs.filePrefix=events
              a1.sinks.k1.hdfs.fileSuffix=.log
              a1.sinks.k1.hdfs.rollInterval=60
              a1.sinks.k1.hdfs.rollSize=134217728
              a1.sinks.k1.hdfs.rollCount=100
              a1.sources.r1.channels = c1
              a1.sinks.k1.channel = c1
              启动flume：
              flume-ng agent \
              --name a1 \
              --conf ${FLUME_HOME}/conf \
              --conf-file /home/hadoop/project/flume/hive/taildir-mem-hdfs-emp_p.conf \
              -Dflume.root.logger=info,console
        3.hive  sink 
 	         1.emp.txt
 	         2.hive emp  普通表
                  souce：taildir
                  channel:mem
                  sink:hivesink
           3.文件内容
              a1.sources = r1
              a1.sinks = k1
              a1.channels = c1
              
              a1.sources.r1.type = TAILDIR
              a1.sources.r1.filegroups = f1
              a1.sources.r1.filegroups.f1=/home/hadoop/tmp/emp.txt
              
              a1.channels.c1.type = memory
              
              a1.sinks.k1.type = hive
              a1.sinks.k1.hive.metastore=   => 需要hive 启动metastore 服务
              a1.sinks.k1.hive.database=bigdata_hive
              a1.sinks.k1.hive.table=emp
              a1.sinks.k1.serializer=DELIMITED  ==>指定表中字段分割符
              a1.sinks.k1.serializer.delimiter=','
              a1.sinks.k1.serializer.fieldnames=empno,ename,job,mgr,hiredate,sal,comm,deptno
              
              a1.sources.r1.channels = c1
              a1.sinks.k1.channel = c1
              
              ---------------
              a1.sources = r1
              a1.sinks = k1
              a1.channels = c1
              
              a1.sources.r1.type = TAILDIR
              a1.sources.r1.filegroups = f1
              a1.sources.r1.filegroups.f1=/home/hadoop/project/flume/hive/bucket_00000
              
              a1.channels.c1.type = memory
              a1.channels.c1.transactionCapacity=15000
              
              a1.sinks.k1.type = hive
              a1.sinks.k1.hive.metastore= thrift://127.0.0.1:9083
              a1.sinks.k1.hive.database=bigdata_hive
              a1.sinks.k1.hive.table=emp
              a1.sinks.k1.serializer=DELIMITED
              a1.sinks.k1.serializer.delimiter=','
              a1.sinks.k1.serializer.fieldnames=empno,ename,job,mgr,hiredate,sal,comm,deptno
              a1.sinks.k1.batchSize=100
              
              a1.sources.r1.channels = c1
              a1.sinks.k1.channel = c1
               报错：1.设置channels为15000  或者sinks为100(默认15000)
                     让channels >= sinks
                     2.flume lib目录下添加hive-hcatalog-streaming-3.1.3.jar
               
               启动：
                 flume-ng agent \
                 --name a1 \
                 --conf ${FLUME_HOME}/conf \
                 --conf-file /home/hadoop/project/flume/hive/taildir-mem-hive-emp.conf \
                 -Dflume.root.logger=info,console
                 
        4.hive 普通表+table 开启事务【Acid】
               1.差别：
                    1.source  emp.txt  =>行式存储
                    2.table  hive acid orc 列式存储
                      加入数据：insert into table table_name select * from emp.txt 
               
               2.sink:
                   hdfs
                   hive => hdfs
                   logger (控制台)
                   avro +》序列化
                
                3.通常不需要 双层flume
                
                4. log => flume => hdfs
                                 => 实时计算
                                 =》kafka =》实时计算

4.文件压缩

5.avro：第一个agent的 sink 作为第二个 agent的source
要求：读取1111端口数据数据发送到2222端口最终2222端口把数据写入hdfs

    agent：
           nc-mem-avro   (开启端口)
           avro-mem-hdfs (将1111的数据传入2222)
           avro-mem-logger (将2222的数据打印到控制台)
    
    agent1：telnet localhost 1111
     
    agent2： nc-mem-avro.conf
             a1.sources = r1
             a1.sinks = k1
             a1.channels = c1
             
             a1.sources.r1.type = netcat
             a1.sources.r1.bind = localhost
             a1.sources.r1.port = 1111
             
             a1.channels.c1.type = memory
             #sink的类型为avro
             a1.sinks.k1.type = avro
             a1.sinks.k1.hostname=bigdata13
             a1.sinks.k1.port=2222
             
             a1.sources.r1.channels = c1
             a1.sinks.k1.channel = c1
            启动：flume-ng agent \
                  --name a1 \
                  --conf ${FLUME_HOME}/conf \
                  --conf-file /home/hadoop/project/flume/avro/nc-mem-avro.conf \
                  -Dflume.root.logger=info,console
    agent3： avro-mem-logger.conf
            a1.sources = r1
            a1.sinks = k1
            a1.channels = c1
            
            a1.sources.r1.type = avro
            a1.sources.r1.bind = bigdata13
            a1.sources.r1.port = 2222
            a1.channels.c1.type = memory
            a1.sinks.k1.type = logger
            a1.sources.r1.channels = c1
            a1.sinks.k1.channel = c1
    启动：
            flume-ng agent \
            --name a1 \
            --conf ${FLUME_HOME}/conf \
            --conf-file /home/hadoop/project/flume/avro/avro-mem-logger.conf \
            -Dflume.root.logger=info,console
    启动顺序：agent3 ->agent2 ->agent1