在spark shell中运行下述代码:
val max_array = max_read_fav_share_vote.collect
val max_read = max_array(0)(0).toString.toDouble
val max_fav = max_array(0)(1).toString.toDouble
val max_share = max_array(0)(2).toString.toDouble
val max_vote = max_array(0)(3).toString.toDouble
val id_hot = serviceid_read_fav_share_vote.map(x=>
{
val id = x.getString(0)
val read = x.getLong(1).toDouble
val fav = x.getLong(2).toDouble
val share = x.getLong(3).toDouble
val vote = x.getLong(4).toDouble
val hot = 0.1 * (read/ max_read) + 0.2 * (fav/ max_fav) +0.3 * (share/ max_share) +0.4 * (vote/ max_vote)
(id,hot)
}).toDF("id","hot")
出现错误:
这是因为在map、filter中使用了外部的变量,而spark中任务的执行是需要将对象分布式传送到各个节点上去的。因为数据就分布式存储在各个节点上,因此传送之前需要将对象序列化,但是有些变量不能序列化。
解决方法是:
对于不能序列化的变量,就不进行传送,让其在各个节点上使用即可,将map改成使用mapPartitions等方法即可,代码修改为:
val id_hot = serviceid_read_fav_share_vote.mapPartitions{
partition =>
partition.map{
x=>
{
val id = x.getString(0)
val read = x.getLong(1).toDouble
val fav = x.getLong(2).toDouble
val share = x.getLong(3).toDouble
val vote = x.getLong(4).toDouble
val hot = 0.1 * (read/max_read) + 0.2 * (fav/max_fav) + 0.3 * (share/max_share) +0.4 * (vote/max_vote)
(id,hot)
}
}
}.toDF("id","hot")