flume day02 (案例)

  • 目录

    1.hive目录名的修改

    修改流程

    2.数据延迟

    3.案例  log日志数据 flume 写入hive table 

    1.普通表

    1.创建数据

    2.agent

    3.启动flume

    2.分区表

    1.创建分区表

    2. 在hive上将deptno=10分区上传到hdfs

    3.在hive中删除 deptno=10分区

    4. agent

    5.启动flume

    6. 解决hive上无结果问题

    4.avro

    案例   读取1111端口数据 数据发送到2222端口 最终2222端口 把数据写入hdfs

    1. 两个agent

    2.启动flume (三个)

    5.缩存储 bzip2


  • 1.hive目录名的修改

    • 未指定前的目录名

      指定后的目录名

       
      event:一条数据
      	headers:描述信息
      	body:存的是实实在在的数据
    • 修改流程

      • 1. 添加数据 (source.log)
            路径:/home/hadoop/tmp
        for x in {1..100}
        do 
          echo "dl2262,${x}" >> /home/hadoop/tmp/source.log
          sleep 0.1s
        done  
      • 2.agent
         配置 taildir-mem-hdfs-round2.conf文件
         路径:/home/hadoop/project/flume
         [hadoop@bigdata13 flume]$ vim taildir-mem-hdfs-round2.conf
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1
        
        a1.sources.r1.type = TAILDIR
        a1.sources.r1.filegroups = f1
        a1.sources.r1.filegroups.f1=/home/hadoop/tmp/source.log/
        
        a1.channels.c1.type = memory
        
        a1.sinks.k1.type = hdfs
        a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/flume/log/%Y-%m-%d/
        a1.sinks.k1.hdfs.fileType=DataStream
        a1.sinks.k1.hdfs.writeFormat=Text
        #文件前后缀
        a1.sinks.k1.hdfs.filePrefix=events
        a1.sinks.k1.hdfs.fileSuffix=.log
        a1.sinks.k1.hdfs.useLocalTimeStamp=true
        
        #文件滚动
        a1.sinks.k1.hdfs.rollInterval=60
        a1.sinks.k1.hdfs.rollSize=134217728
        a1.sinks.k1.hdfs.rollCount=500
        
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1

           a1.sinks.k1.hdfs.useLocalTimeStamp=true
           指定数据落盘 依照的时间是本地机器的时间 而不是 数据本身的时间

      • 3.启动flume
           路径:/home/hadoop/app/flume/conf
        flume-ng agent \
        --name a1 \
        --conf ${FLUME_HOME}/conf \
        --conf-file /home/hadoop/project/flume/taildir-mem-hdfs-round2.conf \
        -Dflume.root.logger=info,console
        

         

  • 2.数据延迟

    • 定义:先产生的数据后到,后产生的数据先到
    • 解决
      • log采集数据 => hive
        • 1. udf函数  :正确的数据重新落盘到正确的分区【数据清洗】
        • 2. flume 源头解决 (log => flume => hive => hdfs) 
          • 指定数据落盘 : 依照的时间是本地机器的时间 而不是数据本身的时间
          • event =>  hdfs 分区下面
            • 1. useLocalTimeStamp 解决不了数据延迟
            • 2. header 添加数据本身的时间 : 保证正确数据落盘到正确分区
                  二次开发 flume
  • 3.案例  log日志数据 flume 写入hive table 

    • log日志数据 flume 写入hive table
          emp.txt
          agent:
              source : taildir
              channel :mem
              sink:
                HDFS Sink
                Hive Sink => 没有用过 测试过 一堆坑 
    • 1.普通表

      • 1.创建数据

        • 1.路径:/home/hadoop/tmp/emp.txt
          insert into emp values (7369, 'SMITH', 'CLERK', 7902, '1980-12-17', 800, null, 20);
          insert into emp values (7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20', 1600, 300, 30);
          insert into emp values (7521, 'WARD', 'SALESMAN', 7698, '1981-02-22', 1250, 500, 30);
          insert into emp values (7566, 'JONES', 'MANAGER', 7839, '1981-04-02', 2975, null, 20);
          insert into emp values (7654, 'MARTIN', 'SALESMAN', 7698, '1981-09-28', 1250, 1400, 30);
          insert into emp values (7698, 'BLAKE', 'MANAGER', 7839, '1981-05-01', 2850, null, 30);
          insert into emp values (7782, 'CLARK', 'MANAGER', 7839, '1981-06-09', 2450, null, 10);
          insert into emp values (7788, 'SCOTT', 'ANALYST', 7566, '1982-12-09', 3000, null, 20);
          insert into emp values (7839, 'KING', 'PRESIDENT', null, '1981-11-17', 5000, null, 10);
          insert into emp values (7844, 'TURNER', 'SALESMAN', 7698, '1981-09-08', 1500, 0, 30);
          insert into emp values (7876, 'ADAMS', 'CLERK', 7788, '1983-01-12', 1100, null, 20);
          insert into emp values (7900, 'JAMES', 'CLERK', 7698, '1981-12-03', 950, null, 30);
          insert into emp values (7902, 'FORD', 'ANALYST', 7566, '1981-12-03', 3000, null, 20);
          insert into emp values (7934, 'MILLER', 'CLERK', 7782, '1982-01-23', 1300, null, 10);
          
        • 2.hive上的表
          CREATE TABLE emp (
            empno decimal(4,0) ,
            ename string ,
            job string ,
            mgr decimal(4,0) ,
            hiredate string ,
            sal decimal(7,2) ,
            comm decimal(7,2) ,
            deptno decimal(2,0) 
          ) 
          row format  delimited fields terminated by ','
          stored as textfile;

             空表 无内容
             hdfs上呈现内容

      • 2.agent

      •    配置taildir-men-hdfs-emp.conf文件
           路径:/home/hadoop/project/flume/hive
           [hadoop@bigdata13 hive]$ vim taildir-men-hdfs-emp.conf

        agent:
        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1
        
        a1.sources.r1.type = TAILDIR
        a1.sources.r1.filegroups = f1
        a1.sources.r1.filegroups.f1=/home/hadoop/tmp/emp.txt
        
        a1.channels.c1.type = memory
        
        a1.sinks.k1.type = hdfs
        a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/user/hive/warehouse/bigdata_flume.db/emp
        a1.sinks.k1.hdfs.fileType=DataStream
        a1.sinks.k1.hdfs.writeFormat=Text
        #文件前后缀
        a1.sinks.k1.hdfs.filePrefix=events
        a1.sinks.k1.hdfs.fileSuffix=.log
        
        #文件滚动
        a1.sinks.k1.hdfs.rollInterval=60
        a1.sinks.k1.hdfs.rollSize=134217728
        a1.sinks.k1.hdfs.rollCount=100
        
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1
      • 3.启动flume
         

      • 路径:/home/hadoop/app/flume/conf
        flume-ng agent \
        --name a1 \
        --conf ${FLUME_HOME}/conf \
        --conf-file /home/hadoop/project/flume/hive/taildir-men-hdfs-emp.conf \
        -Dflume.root.logger=info,console

          hdfs呈现内容

          hive插入数据

    • 2.分区表

      • 1.创建分区表

        • 1.创建表
          CREATE  TABLE emp_p (
            empno decimal(4,0) ,
            ename string ,
            job string ,
            mgr decimal(4,0) ,
            hiredate string ,
            sal decimal(7,2) ,
            comm decimal(7,2)
          ) 
          PARTITIONED BY (deptno decimal(2,0))
          row format  delimited fields terminated by ','
          stored as textfile;
        • 2.插入数据
          set hive.exec.dynamic.partition.mode=nonstrict;
          insert overwrite table emp_p partition(deptno)
          select
          empno,
          ename,
          job  ,
          mgr  ,
          hiredate,
          sal  ,
          comm ,
          deptno
          from emp;
        • 3.查看分区表
          hive (bigdata_flume)> show partitions emp_p;


          hdfs呈现

      • 2. 在hive上将deptno=10分区上传到hdfs

      • [hadoop@bigdata13 tmp]$ hadoop fs -get /user/hive/warehouse/bigdata_flume.db/emp_p/deptno=10/*

      • 3.在hive中删除 deptno=10分区

      • hive (bigdata_flume)> alter table emp_p drop partition(deptno=10);
           查看分区:show partitions emp_p;

         

      • 4. agent

      •    配置 taildir-men-hdfs-emp.conf文件
           路径:/home/hadoop/project/flume/hive
           [hadoop@bigdata13 hive]$ vim taildir-men-hdfs-emp.conf

        a1.sources = r1
        a1.sinks = k1
        a1.channels = c1
        
        a1.sources.r1.type = TAILDIR
        a1.sources.r1.filegroups = f1
        a1.sources.r1.filegroups.f1=/home/hadoop/tmp/000000_0
        
        a1.channels.c1.type = memory
        
        a1.sinks.k1.type = hdfs
        a1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/user/hive/warehouse/bigdata_flume.db/emp_p/deptno=10
        a1.sinks.k1.hdfs.fileType=DataStream
        a1.sinks.k1.hdfs.writeFormat=Text
        #文件前后缀
        a1.sinks.k1.hdfs.filePrefix=events
        a1.sinks.k1.hdfs.fileSuffix=.log
        
        #文件滚动
        a1.sinks.k1.hdfs.rollInterval=60
        a1.sinks.k1.hdfs.rollSize=134217728
        a1.sinks.k1.hdfs.rollCount=100
        
        a1.sources.r1.channels = c1
        a1.sinks.k1.channel = c1
      • 5.启动flume

      •    路径:/home/hadoop/app/flume/conf

        flume-ng agent \
        --name a1 \
        --conf ${FLUME_HOME}/conf \
        --conf-file /home/hadoop/project/flume/hive/taildir-men-hdfs-emp_p.conf \
        -Dflume.root.logger=info,console

          启动后hdfs有结果而hive上无结果

         

      • 6. 解决hive上无结果问题

      •    手动添加分区:hive (bigdata_flume)> alter table emp_p add partition(deptno=10);

  • 4.avro

    • 定义:第一个agent的 sink 作为 第二个 agent的source

    • sink:

      • hdfs

      • hive => hdfs

      • logger

      • avro => 序列化的东西

    • 案例   读取1111端口数据 数据发送到2222端口 最终2222端口 把数据写入hdfs

      • 需求:读取1111端口数据 数据发送到2222端口 最终2222端口 把数据写入hdfs
                    agent:
                          nc-mem-avro
                          avro-mem-hdfs
                          avro-mem-logger

      • 1. 两个agent

        • agent1:
           配置 nc-mem-avro.conf文件
           路径:/home/hadoop/project/flume/avro
           [hadoop@bigdata13 avro]$ vim nc-mem-avro.conf

          a1.sources = r1
          a1.sinks = k1
          a1.channels = c1
          
          a1.sources.r1.type = netcat
          a1.sources.r1.bind = localhost
          a1.sources.r1.port = 1111
          
          a1.channels.c1.type = memory
          
          a1.sinks.k1.type = avro
          a1.sinks.k1.hostname=bigdata13
          a1.sinks.k1.port=2222
          
          a1.sources.r1.channels = c1
          a1.sinks.k1.channel = c1
        •  agent2
           配置 avro-mem-logger.conf文件
           路径:/home/hadoop/project/flume/avro
           [hadoop@bigdata13 avro]$ vim avro-mem-logger.conf
                            avro-mem-hdfs
                            avro-mem-logger

          a1.sources = r1
          a1.sinks = k1
          a1.channels = c1
          
          a1.sources.r1.type = avro
          a1.sources.r1.bind = bigdata13
          a1.sources.r1.port = 2222
          a1.channels.c1.type = memory
          a1.sinks.k1.type = logger
          a1.sources.r1.channels = c1
          a1.sinks.k1.channel = c1

      • 2.启动flume (三个)

        • 先启动端口2222的后启动端口1111的,最后启动source.data 
          【从后向前启】

        • 启动flume1

          flume-ng agent \
          --name a1 \
          --conf ${FLUME_HOME}/conf \
          --conf-file /home/hadoop/project/flume/avro/avro-mem-logger.conf \
          -Dflume.root.logger=info,console
          
        • 启动flume2

          flume-ng agent \
          --name a1 \
          --conf ${FLUME_HOME}/conf \
          --conf-file /home/hadoop/project/flume/avro/nc-men-avro.conf \
          -Dflume.root.logger=info,console
          
        • 启动flume3
          连接本地

          telnet localhost 1111

           ①左上角启动的  nc-mem-avro.conf            1111
           ②左下角启动的  avro-mem-logger.conf      2222
           ③右边启动的 telnet localhost 1111

           在③中的命令在②中表现

  • 5.压缩存储 bzip2

    • 案例:
              采集日志数据 [自己造] 3k条 :java shell
              采集到hdfs上 采用压缩存储 bzip2
      • 过程
                 source: exec taildir
                 channle :mem file
                 sink: hdfs => bzip2
      • 1.数据
           路径:
        for x in {1..3000}
        do 
          echo "dl2262,${x}" >> /home/hadoop/tmp/codec.log
          sleep 0.1s
        done  
      • 2.agent
          配置 taildir-mem-hdfs-bzip2.conf 文件
          路径:/home/hadoop/project/flume
          [hadoop@bigdata13 flume]$ vim taildir-mem-hdfs-bzip2.conf 
        agent1.sources = r1
        agent1.sinks = k1
        agent1.channels = c1
        
        agent1.sources.r1.type = TAILDIR
        agent1.sources.r1.filegroups = f1
        agent1.sources.r1.filegroups.f1=/home/hadoop/tmp/codec.log
        
        agent1.channels.c1.type = memory
        
        agent1.sinks.k1.type = hdfs
        agent1.sinks.k1.hdfs.path = hdfs://bigdata13:9000/flume/bzip2/
        agent1.sinks.k1.hdfs.fileType=CompressedStream
        agent1.sinks.k1.hdfs.writeFormat=Text
        #文件前后缀
        agent1.sinks.k1.hdfs.filePrefix=events
        agent1.sinks.k1.hdfs.fileSuffix=.bz2
        agent1.sinks.k1.hdfs.codeC=bzip2
        
        #文件滚动 
        agent1.sinks.k1.hdfs.rollInterval=60
        agent1.sinks.k1.hdfs.rollSize=134217728
        agent1.sinks.k1.hdfs.rollCount=100
        
        agent1.sources.r1.channels = c1
        agent1.sinks.k1.channel = c1
      • 3.启动
           路径:/home/hadoop/app/flume/conf
        flume-ng agent \
        --name agent1 \
        --conf ${FLUME_HOME}/conf \
        --conf-file /home/hadoop/project/flume/taildir-mem-hdfs-bzip2.conf \
        -Dflume.root.logger=info,console
        

        hdfs中内容由于压缩看不到


        使用命令查看 [hadoop@bigdata13 tmp]$ hadoop fs -text /flume/bzip2/events.1670985423512.bz2
         /flume/bzip2 :路径
         events.1670985423512.bz2 : 文件名

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值