1. 报错:resolved attribute(s)
var multiInsuCountDf = multiInsuDf.select("req_id", "main_flag", "name", "idn")
multiInsuCountDf = multiInsuCountDf.groupBy("req_id", "main_flag", "name", "idn").count()
val multiInsuDfResult = multiInsuDf.join(multiInsuCountDf, Seq("req_id", "main_flag", "name", "idn"), "left")
解决方案:将multiInsuCountDf中的列名进行重命名
原因:multiInsuCountDf是从multiInsuDf分出去的,导致内存中数据重复了,可能,不太确定
var multiInsuCountDf = multiInsuDf.select("req_id", "main_flag", "name", "idn")
.withColumnRenamed("req_id", "req_id")
.withColumnRenamed("main_flag", "main_flag")
.withColumnRenamed("name", "name")
.withColumnRenamed("idn", "idn")
multiInsuCountDf = multiInsuCountDf.groupBy("req_id", "main_flag", "name", "idn").count()
val multiInsuDfResult = multiInsuDf.join(multiInsuCountDf, Seq("req_id", "main_flag", "name", "idn"), "left")
2.spark java.lang.OutOfMemoryError: GC overhead limit exceeded
在使用spark过程中,有时会因为数据增大,而出现下面两种错误:
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError:GC overhead limit exceeded
通过修改配置,增大driver的内存,默认内存为512M
spark.driver.memory 2g