spark中monotonically_increasing_id的坑

最新推荐文章于 2024-01-22 11:25:21 发布

like_red

最新推荐文章于 2024-01-22 11:25:21 发布

阅读量7k

点赞数 2

分类专栏：日常记录

本文链接：https://blog.csdn.net/like_red/article/details/103433645

版权

日常记录专栏收录该内容

18 篇文章 0 订阅

订阅专栏

日常工作中因为获取到的交互矩阵中user是string的，所以需要转换成long或int的unique id。本来以为发现了一个非常好用的函数monotonically_increasing_id，再join回来就行了，直接可以实现为：

import org.apache.spark.sql.functions.monotonically_increasing_id 

userdf = df.select("user").dropDuplicates().withColumn("userid", monotonically_increasing_id())
newdf = df.join(userdf, "user")

然而在送到spark自带的als中进行训练的时候，因为需要将userid转换成int格式，这里就埋下了一个巨坑。

val ratingRdd = newdf.rdd.map(r =>
            Rating(r.getAs[Long]("user").toInt, 
            r.getAs[String]("productid").toInt, 
            r.getAs[Long]("rating").toDouble)
            ).cache()

因为monotonically_increasing_id是long格式的，而且是对每个partition中是从不同的long数值开始无重复递增。看一下原始文档：

A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
The current implementation puts the partition ID in the upper 31 bits, and the record number
within each partition in the lower 33 bits. The assumption is that the data frame has
less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame with two partitions, each with 3 records.
This expression would return the following IDs:
{{{ * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. * }}}

然而当数据量一大，从Long转成int的时候，就会被截断，所以我ratingRdd中的user量会比原始的df中的user量少很多。

解决方法：
1、不用monotonically_increasing_id了，老老实实的用zipWithIndex之后再toDF
2、或者是先把distinct之后的数据repartition到1个partition中。但这种方法在用户量非常大的情况下也是失效的，因为无法避免Long的位数比Int能记录的位数多的情况。

like_red

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
5
评论
spark中monotonically_increasing_id的坑

日常工作中因为获取到的交互矩阵中user是string的，所以需要转换成long或int的unique id。本来以为发现了一个非常好用的函数monotonically_increasing_id，再join回来就行了，直接可以实现为：import org.apache.spark.sql.functions.monotonically_increasing_id userdf = df.s...
复制链接

扫一扫

专栏目录