日常用到的spark调优
1.如果使用spark读取mysql中的表格写进数仓中,可以先将限制条件或者整个sql先写进jdbc连接参数中。如果是将mysql整个表读入内存中再建立临时表,如果mysql的表过大,反而会占用更多内存,所以可以提前先将结果的数据读进来直接write进数仓中。
val callmysql = s" select '会议' as module,'正常通话率' as metrics_type,'ads_voip_sample' as original_table,avg(metrics) as metrics,isWorkDay,isBaidu as corpId,dat as data_dat from ads_voip_sample where orderParentTabFlag = '4' and callType = 'All' and deviceType = 'All' and metrics <= 1 and dat >='${args(0)}' and dat <= '${args(1)}' group by isWorkDay,isBaidu,dat"
val callLog = spark.read.jdbc(url, callmysql, prop)
callLog.write.insertinto("table_name")
2.为了减少小文件,可以从源头下手,将要插入数仓的数据提前将分区数量减少
res.repartition(1).write.mode(SaveMode.Append).jdbc(url,"table_name",prop)