spark中monotonically_increasing_id的坑

日常工作中因为获取到的交互矩阵中user是string的,所以需要转换成long或int的unique id。本来以为发现了一个非常好用的函数monotonically_increasing_id,再join回来就行了,直接可以实现为:

import org.apache.spark.sql.functions.monotonically_increasing_id 

userdf = df.select("user").dropDuplicates().withColumn("userid", monotonically_increasing_id())
newdf = df.join(userdf, "user")

然而在送到spark自带的als中进行训练的时候,因为需要将userid转换成int格式,这里就埋下了一个巨坑。

val ratingRdd = newdf.rdd.map(r =>
            Rating(r.getAs[Long]("user").toInt, 
            r.getAs[String]("productid").toInt, 
            r.getAs[Long]("rating").toDouble)
            ).cache()

因为monotonically_increasing_id是long格式的,而且是对每个partition中是从不同的long数值开始无重复递增。看一下原始文档:

  • A column expression that generates monotonically increasing 64-bit integers.
  • The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
  • The current implementation puts the partition ID in the upper 31 bits, and the record number
  • within each partition in the lower 33 bits. The assumption is that the data frame has
  • less than 1 billion partitions, and each partition has less than 8 billion records.
  • As an example, consider a DataFrame with two partitions, each with 3 records.
  • This expression would return the following IDs:
  • {{{ * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. * }}}

然而当数据量一大,从Long转成int的时候,就会被截断,所以我ratingRdd中的user量会比原始的df中的user量少很多。

解决方法:
1、不用monotonically_increasing_id了,老老实实的用zipWithIndex之后再toDF
2、或者是先把distinct之后的数据repartition到1个partition中。但这种方法在用户量非常大的情况下也是失效的,因为无法避免Long的位数比Int能记录的位数多的情况。

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值