日常工作中因为获取到的交互矩阵中user是string的,所以需要转换成long或int的unique id。本来以为发现了一个非常好用的函数monotonically_increasing_id,再join回来就行了,直接可以实现为:
import org.apache.spark.sql.functions.monotonically_increasing_id
userdf = df.select("user").dropDuplicates().withColumn("userid", monotonically_increasing_id())
newdf = df.join(userdf, "user")
然而在送到spark自带的als中进行训练的时候,因为需要将userid转换成int格式,这里就埋下了一个巨坑。
val ratingRdd = newdf.rdd.map(r =>
Rating(r.getAs[Long]("user").toInt,
r.getAs[String]("productid").toInt,
r.getAs[Long]("rating").toDouble)
).cache()
因为monotonically_increasing_id是long格式的,而且是对每个partition中是从不同的long数值开始无重复递增。看一下原始文档:
- A column expression that generates monotonically increasing 64-bit integers.
- The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
- The current implementation puts the partition ID in the upper 31 bits, and the record number
- within each partition in the lower 33 bits. The assumption is that the data frame has
- less than 1 billion partitions, and each partition has less than 8 billion records.
- As an example, consider a
DataFrame
with two partitions, each with 3 records.- This expression would return the following IDs:
- {{{ * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. * }}}
然而当数据量一大,从Long转成int的时候,就会被截断,所以我ratingRdd中的user量会比原始的df中的user量少很多。
解决方法:
1、不用monotonically_increasing_id了,老老实实的用zipWithIndex之后再toDF
2、或者是先把distinct之后的数据repartition到1个partition中。但这种方法在用户量非常大的情况下也是失效的,因为无法避免Long的位数比Int能记录的位数多的情况。