业务需求
使用spark集成数仓,并对数仓中的文章视频关键标题及标签字段进行分词
分词方法
使用Hanlp工具包进行分词:如下是一开始实现的分词逻辑
/**
* 定义分词函数
*
* @param elements
* @return
*/
def separate(elements: Iterator[VideoInfo]): Iterator[VideoWords] = {
var termList: List[Term] = List()
var infoWithWordsList: List[VideoWords] = List()
while (elements.hasNext) {
var row = elements.next()
var info = row.info
if (info != null && info.length > 0) {
info = info.replaceAll("\\p{P}|\\p{Z}", "") //使用unicode 正则 过滤标点符合和空格
}
HanLP.Config.ShowTermNature = false //去掉词性
var terms = CoreStopWordDictionary.apply(StandardTokenizer.segment(info)) //不使用停用词和标点符号
import scala.collection.JavaConversions._
for (term <- terms) {
if (term.word.length() > 1) {
termList = term +: termList //遍历分词,添加到分词列表
}
}
var words = termList.toString().replace("List(", "").replace(")", "")
var infoWithWords = VideoWords(row.v_id, row.v_title, row.v_fc, words)
infoWithWordsList = infoWithWords +: infoWithWordsList
}
infoWithWordsList.iterator
}
打包运行报错
刚开始我在本地进行测试,是从数仓视频表中导出100条数据进行运行,发现是可以success。但是当从视频表中导出全部表【800万+】,发现master和works中都有OOM报错。部分截取报错如下:
21/12/10 09:26:17 ERROR Executor: Exception in task 112.0 in stage 0.0 (TID 112)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:514)
at java.lang.StringBuffer.append(StringBuffer.java:352)
at java.util.regex.Matcher.appendTail(Matcher.java:911)
at java.util.regex.Matcher.replaceAll(Matcher.java:958)
at java.lang.String.replace(String.java:2240)
at com.bigdata.reco_cal.video.TFIDFModel.VideoCalTFModel$.separate(VideoCalTFModel.scala:93)
at com.bigdata.reco_cal.video.TFIDFModel.VideoCalTFModel$$anonfun$1.apply(VideoCalTFModel.scala:113)
at com.bigdata.reco_cal.video.TFIDFModel.VideoCalTFModel$$anonfun$1.apply(VideoCalTFModel.scala:113)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
21/12/10 09:26:17 INFO ShutdownHookManager: Shutdown hook called
21/12/10 09:26:17 INFO Executor: Not reporting error to driver during JVM shutdown.
增加资源:
一开始看到OOM,以为是spark资源不足,所以一直累加资源,从刚开始的该配置:
spark.driver.memory = 4g
spark.executor.memory = 6g
增加堆外内存,用下面两个配置项:
spark.driver.memoryOverhead = 1g
spark.executor.memoryOverhead = 1g
到最后累加到该配置:
增加堆内存,用下面两个配置项:
spark.driver.memory = 36g
spark.executor.memory = 24g
增加堆外内存,用下面两个配置项:
spark.driver.memoryOverhead = 12g
spark.executor.memoryOverhead = 8g
当然结果还是不行,然后自查代码发现是代码的锅~~😓😓😓
自查代码
从中可以清楚的看到,是定义的分词方法中的replace
方法OOM了。那么看调用replace的对象,发现是termList
。
最后可以得出结论,termList
是作为全局变量的,相当于该对象始终在后追加每一条记录的info
【视频多维度介绍】字段,因为数仓中一条视频对应的视频简介、标题、相关介绍等文字是非常之长的,所以就会报错OOM。
代码优化:
/**
* 定义分词函数
*
* @param elements
* @return
*/
def separate(elements: Iterator[VideoInfo]): Iterator[VideoWords] = {
var infoWithWordsList: List[VideoWords] = List()
while (elements.hasNext) {
var row = elements.next()
var info = row.info
if (info != null && info.length > 0) {
info = info.replaceAll("\\p{P}|\\p{Z}", "") //使用unicode 正则 过滤标点符合和空格
}
HanLP.Config.ShowTermNature = false //去掉词性
var terms = CoreStopWordDictionary.apply(StandardTokenizer.segment(info)) //不使用停用词和标点符号
import scala.collection.JavaConversions._
var termList: List[Term] = List()
for (term <- terms) {
if (term.word.length() > 1) {
termList = term +: termList //遍历分词,添加到分词列表
}
}
var words = termList.toString().replace("List(", "").replace(")", "")
var infoWithWords = VideoWords(row.v_id, row.v_title, row.v_fc, words)
infoWithWordsList = infoWithWords +: infoWithWordsList
}
infoWithWordsList.iterator
}
问题解决
Thank you~