同个spark任务在数据量变大时任务抛出错误

业务需求

使用spark集成数仓,并对数仓中的文章视频关键标题及标签字段进行分词

分词方法

使用Hanlp工具包进行分词:如下是一开始实现的分词逻辑

/**
   * 定义分词函数
   *
   * @param elements
   * @return
   */
  def separate(elements: Iterator[VideoInfo]): Iterator[VideoWords] = {

	var termList: List[Term] = List()
    var infoWithWordsList: List[VideoWords] = List()

    while (elements.hasNext) {
      var row = elements.next()

      var info = row.info

      if (info != null && info.length > 0) {
        info = info.replaceAll("\\p{P}|\\p{Z}", "") //使用unicode 正则 过滤标点符合和空格
      }

      HanLP.Config.ShowTermNature = false //去掉词性

      var terms = CoreStopWordDictionary.apply(StandardTokenizer.segment(info)) //不使用停用词和标点符号
      import scala.collection.JavaConversions._
      
      for (term <- terms) {
        if (term.word.length() > 1) {
          termList = term +: termList //遍历分词,添加到分词列表
        }
      }

      var words = termList.toString().replace("List(", "").replace(")", "")
      var infoWithWords = VideoWords(row.v_id, row.v_title, row.v_fc, words)

      infoWithWordsList = infoWithWords +: infoWithWordsList
    }
    infoWithWordsList.iterator
  }

打包运行报错

刚开始我在本地进行测试,是从数仓视频表中导出100条数据进行运行,发现是可以success。但是当从视频表中导出全部表【800万+】,发现master和works中都有OOM报错。部分截取报错如下:

21/12/10 09:26:17 ERROR Executor: Exception in task 112.0 in stage 0.0 (TID 112)
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:514)
	at java.lang.StringBuffer.append(StringBuffer.java:352)
	at java.util.regex.Matcher.appendTail(Matcher.java:911)
	at java.util.regex.Matcher.replaceAll(Matcher.java:958)
	at java.lang.String.replace(String.java:2240)
	at com.bigdata.reco_cal.video.TFIDFModel.VideoCalTFModel$.separate(VideoCalTFModel.scala:93)
	at com.bigdata.reco_cal.video.TFIDFModel.VideoCalTFModel$$anonfun$1.apply(VideoCalTFModel.scala:113)
	at com.bigdata.reco_cal.video.TFIDFModel.VideoCalTFModel$$anonfun$1.apply(VideoCalTFModel.scala:113)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
21/12/10 09:26:17 INFO ShutdownHookManager: Shutdown hook called
21/12/10 09:26:17 INFO Executor: Not reporting error to driver during JVM shutdown.

增加资源:

一开始看到OOM,以为是spark资源不足,所以一直累加资源,从刚开始的该配置:

spark.driver.memory = 4g
spark.executor.memory = 6g

增加堆外内存,用下面两个配置项:
spark.driver.memoryOverhead = 1g
spark.executor.memoryOverhead = 1g

到最后累加到该配置:

增加堆内存,用下面两个配置项:
spark.driver.memory = 36g
spark.executor.memory = 24g

增加堆外内存,用下面两个配置项:
spark.driver.memoryOverhead = 12g
spark.executor.memoryOverhead = 8g

当然结果还是不行,然后自查代码发现是代码的锅~~😓😓😓

自查代码

从中可以清楚的看到,是定义的分词方法中的replace方法OOM了。那么看调用replace的对象,发现是termList
最后可以得出结论,termList是作为全局变量的,相当于该对象始终在后追加每一条记录的info【视频多维度介绍】字段,因为数仓中一条视频对应的视频简介、标题、相关介绍等文字是非常之长的,所以就会报错OOM。

代码优化:
/**
   * 定义分词函数
   *
   * @param elements
   * @return
   */
  def separate(elements: Iterator[VideoInfo]): Iterator[VideoWords] = {

	
    var infoWithWordsList: List[VideoWords] = List()

    while (elements.hasNext) {
      var row = elements.next()

      var info = row.info

      if (info != null && info.length > 0) {
        info = info.replaceAll("\\p{P}|\\p{Z}", "") //使用unicode 正则 过滤标点符合和空格
      }

      HanLP.Config.ShowTermNature = false //去掉词性

      var terms = CoreStopWordDictionary.apply(StandardTokenizer.segment(info)) //不使用停用词和标点符号
      import scala.collection.JavaConversions._
      var termList: List[Term] = List()
      for (term <- terms) {
        if (term.word.length() > 1) {
          termList = term +: termList //遍历分词,添加到分词列表
        }
      }

      var words = termList.toString().replace("List(", "").replace(")", "")
      var infoWithWords = VideoWords(row.v_id, row.v_title, row.v_fc, words)

      infoWithWordsList = infoWithWords +: infoWithWordsList
    }
    infoWithWordsList.iterator
  }

问题解决

Thank you~

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值