flume spoolDirectorySource:自定义Deserialize读取多行;彻底解决File has changed size since being read;数据下发存hive

spoolDirectorySource

1、解决File has changed size since being read、File has been modified since being read
(参考“骑小象去远方”的文章 《flume spoolDirectorySource中的 File has been modified since being read与File has changed size since being read错误》,优化了部分逻辑,原文是每个文件都会先阻塞1秒,导致采集速度骤降很多)
2、自定义开发Deserializer,将半结构化的文件转成行列结构event下发
3、includepath、ignorepath的正则配合使用
4、flume采集数据存到hive

1、flume的采集配置文件

采集的agent

# 将数据格式化,下发到下一个agent
# Name the components on this agent
a2.sources = r1
a2.sinks = k1 k2
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = spooldir
a2.sources.r1.channels = c1
a2.sources.r1.spoolDir = /data/ftp_root/tmc/mestestftp
#因为特殊用途开发自定义的并行化器,默认是lineDeserializer
a2.sources.r1.deserializer = cn.xxxxx.dmp.tmc.flume.deserializer.TmcTxtDeserializer$Builder
# 最大
a2.sources.r1.deserializer.maxLineLength = 1024000
# oldest youngest random 处理顺序,random最快,数据堆积后可换random
#a2.sources.r1.consumeOrder = oldest
a2.sources.r1.consumeOrder = random
# a1.sources.r1.deserializer.delimiter = \t
a2.sources.r1.fileHeader = true
a2.sources.r1.fileHeaderKey = fishfile
a2.sources.r1.basenameHeader = true
a2.sources.r1.basenameHeaderKey = fishbasenam
a2.sources.r1.inputCharset = GBK
#过滤tmp结尾的
a2.sources.r1.ignorePattern =  ([^ ]*\.tmp$)
#只需rpt、RPT结尾的
a2.sources.r1.includePattern = ([^ ]*\.(rpt|RPT)$)
# never immediate
a2.sources.r1.deletePolicy = immediate
#IGNORE REPLACE,文件编码异常的处理方式
a2.sources.r1.decodeErrorPolicy = IGNORE
#递归子文件夹
a2.sources.r1.recursiveDirectorySearch = true
#时间加大,防止大一点的文件读取超时而失败 5秒
a2.sources.r1.pollDelay = 5000
#多少为一个批次event传输
a2.sources.r1.batchSize = 4000

#自定义interceptor  取文件的工序名称
a2.sources.r1.interceptors = i1
#type的参数不能写成uuid,得写具体,否则找不到类
a2.sources.r1.interceptors.i1.type = cn.xxxxx.dmp.tmc.flume.interceptor.ProcessNameInterceptor$Builder
#如果工序名头已经存在,它应该保存  , 分钟十位数级作为分区
a2.sources.r1.interceptors.i1.preserveExisting = true
a2.sources.r1.interceptors.i1.processHeader = tmcProcessName
a2.sources.r1.interceptors.i1.timeHeader = timeNoSecend

# Describe the sink, two sink balance
# a1.sinks.k1.type = logger
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = flume_slave2
# sink bind to remote host, RPC(上游Agent avro sink绑定到下游主机)
a2.sinks.k1.port = 4444

a2.sinks.k2.type = avro
a2.sinks.k2.hostname = flume_slave
# sink bind to remote host, RPC(上游Agent avro sink绑定到下游主机)
a2.sinks.k2.port = 4444

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 100000
a2.channels.c1.transactionCapacity = 5120

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
a2.sinks.k2.channel = c1

v1.0版本 存到hive的agent


# 01 specify agent,source,sink,channel
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 02 avro source,connect to local port 4444
a1.sources.r1.type = avro        
# 下游avro source绑定到本机,端口号要和上游Agent指定值保持一致
a1.sources.r1.bind = flume_slave
a1.sources.r1.port = 4444

# 03 logger sink
# Describe the sink
# a1.sinks.k1.type = logger
a1.sinks.k1.type = hive
# thrift://dev-data-center-3:9083
a1.sinks.k1.hive.metastore = thrift://bigdata-online-2:9083
# a1.sinks.k1.hive.database = ods
# a1.sinks.k1.hive.table = ods_pre_tmc_test_ba
a1.sinks.k1.hive.database = ods
a1.sinks.k1.hive.table = ods_pre_tmc_test_detail_ba_10min
#按工序,时间分钟十位级分区
a1.sinks.k1.hive.partition = %{
   tmcProcessName},%{
   timeNoSecend}
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.callTimeout = 30000
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.batchSize = 4096
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.maxOpenConnections = 300
a1.sinks.k1.serializer.delimiter = "\t"
# a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =serial_number,test_site,test_bench,slot_number,test_mode,lot_type,operator,start_time,engine_version,gui_version,test_service,scenario_version,jig_sn,rfdevice,powersupply,otherdevice,test_time,service_error,error_code,error_index,code,measure,result,unit,status,tol_min,tol_max,file_path



# Use a channel which buffers events in memory
# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /data1/flume_temp/fchannel/checkpoint
a1.channels.c1.dataDirs = /data2/flume_temp/fchannel/data
a1.channels.c1.capacity = 200000000
a1.channels.c1.keep-alive = 30
a1.channels.c1.write-timeout = 30
a1.channels.c1.checkpoint-timeout = 600

# a1.channels.c1.type = memory
# a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 5120

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

v2.0版本改成存到hdfs的agent(v1.0是存到hive的orc事务表,后期发现不稳定,事务表容易出问题,且吞吐不高),只需将以上的sink配置成以下部分

# Describe the sink
#定义sink
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path = hdfs://ns/user/hive/warehouse/ods.db/ods_pre_tmc_test_detail_10min/p_minute=%{
   timeNoSecend}
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 102400000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1000
a1.sinks.k1.hdfs.txnEventMax = 1000
a1.sinks.k1.hdfs.callTimeout = 60000
a1.sinks.k1.hdfs.appendTimeout = 60000
a1.sinks.k1.hdfs.filePrefix = flume

注意:改成落地到hdfs之后,hive里也要定时add对应的partition,不然在hive里查不到该表的新分区数据。这里是10分钟一个分区,所以contab里每10分钟执行以下脚本

#!/usr/bin/bash
l_date=$(date "+%Y%m%d%H%M%S")
pmin=${l_date:0:11}
beeline -u jdbc:hive2:// --verbose=true -e "ALTER TABLE ods.ods_pre_tmc_test_detail_10min ADD PARTITION (p_minute=$pmin)"

2、彻底解决File has changed size since being read与 File has been modified since being read的代码细节

异常原因,摘抄原作者“骑小象去远方”的原话:

“我们没有对文件做出过任何修改,为什么还是会报这个错误。查看了代码后,发现他的这个线程频率为500ms,当我们拷贝一个大些的文件的时候,500ms还没有拷贝完成,所以就会出现这样的错误。当然flume被设计成500MS,是因为默认大家都是传很小的文件,每几分钟或者每几秒就做写一个日志文件,就不会存在这样的问题。”

补充:就是flume处理文件的时候,该文件还在传输过程中,当flume处理完发现该文件跟处理前发生了变化,导致该次采集肯定是不对的,所以抛出异常。

解决办法:
1、改造flume-ng-core-1.9.0.jar的一个类的代码,首先在官网down下代码,然修改ReliableSpoolingFileEventReader 这个类,在getNextFile() 方法内加入 “文件检查”的方法体(中文注释处),

private Optional<FileInfo> getNextFile() {
   
    List<File> candidateFiles = Collections.emptyList();

    if (consumeOrder != ConsumeOrder.RANDOM ||
        candidateFileIter == null ||
        !candidateFileIter.hasNext()) {
   
      candidateFiles = getCandidateFiles(spoolDirectory.toPath());
      listFilesCount++;
      candidateFileIter = candidateFiles.iterator();
    }

    if (!candidateFileIter.hasNext()) {
    // No matching file in spooling directory.
      return Optional.absent();
    }

    File selectedFile = candidateFileIter.next();
    if (consumeOrder == ConsumeOrder.RANDOM) {
    // Selected file is random.随机速度快
      //修复传输大文件报错文件被修改的BUG,
      this.checkFileCpIsOver(selectedFile); 
      return openFile(selectedFile);
    } else if (consumeOrder == ConsumeOrder.YOUNGEST) {
    
      for (File candidateFile : candidateFiles) {
   
        long compare = selectedFile.lastModified() -
            candidateFile.lastModified();
        if (compare == 0) {
    // ts is same pick smallest lexicographically.
          selectedFile = smallerLexicographical(selectedFile, candidateFile);
        } else if (compare < 0) {
    // candidate is younger (cand-ts > selec-ts)
          selectedFile = candidateFile;
        }
      }
    } else {
    // default order is OLDEST
      for (File candidateFile : candidateFiles) {
   
        long compare = selectedFile.lastModified() -
            candidateFile.lastModified();
        if (compare == 0) {
    // ts is same pick smallest lexicographically.
          selectedFile = smallerLexicographical(selectedFile, candidateFile);
        } else if (compare > 0) {
    // candidate is older (cand-ts < selec-ts).
          selectedFile = candidateFile;
        }
      }
    }

    /
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值