Spark读取gz文件时忽略报错的文件

背景

业务需要解析HDFS上的大量文件,文件使用gzip压缩,使用Spark core来实现,代码如下:

sparkContext.textFile(srcPath);

运行过程中报Caused by: java.io.EOFException: Unexpected end of input stream的错,导致Spark程序直接退出,不能完成解析。

这个错误的原因就是其中有一个gz文件有误,不能被读取,程序直接抛出错误。

解决办法

  • 第一种:执行解析程序前,写程序/脚本去检测文件,将异常的文件移走。

    之前使用Flink实现解析时,遇到这个问题就是这么解决的。(Flink内部不支持解析单个文件错误时跳过)

  • 第二种:配置参数,忽略解析报错的文件。配置如下参数:

    sparkConf.set("spark.files.ignoreCorruptFiles", "true");
    

    配置参数后,碰到解析错误的文件,Spark内部会将错误catch住,我们的程序可以继续运行。

附 Spark忽略错误文件实现

# org/apache/spark/rdd/HadoopRDD.scala
private val ignoreCorruptFiles = sparkContext.conf.get(IGNORE_CORRUPT_FILES)

override def getNext(): (K, V) = {
    try {
        finished = !reader.next(key, value)
    } catch {
        # 如果配置了忽略错误文件,这里会catch住错误,不影响程序的运行
        case e: IOException if ignoreCorruptFiles =>
        logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
        finished = true
    }
    if (!finished) {
        inputMetrics.incRecordsRead(1)
    }
    if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
        updateBytesRead()
    }
    (key, value)
}

附 报错内容

Caused by: java.io.EOFException: Unexpected end of input stream
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:165)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
	at java.io.InputStream.read(InputStream.java:101)
	at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
	at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:214)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1094)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1085)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1020)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1085)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:811)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值