记spark2.3.x的一个bug

背景

在做spark重构代码的时候,账号的二期的有些计算,使用了sparkSQL,这个时候,发现要进行多次的left join,考虑效率问题,提出重分区:

 val price =
      s"""
         |select
         |identify_id,weibo_type,price_info
         |from dm_account.hogwarts_account
         |where identify_id is not null AND weibo_type is not null AND price_info is not null
         |""".stripMargin
    val priceTable = "industry_" + seqNum
    val priceDF = sparkSession.sql(price)
      .withColumn("platform_type", getIdOrPlatform(col("weibo_type"), lit("platform")))
      .withColumn("price", getPriceFromPriceInfo(col("price_info")).cast(DoubleType))
      .filter(s"price is not null AND price >0 AND platform_type is not null AND ${platformFilter}")
      .select("identify_id", "platform_type", "price").toDF()
      .repartition(20,col("identify_id")) //2.3.x有bug
      .createTempView(priceTable)

注意使用的repartition函数,看重载,应该有如下方式:
def repartition(numPartitions: Int, partitionExprs: Column): Dataset[T]*

但是死活运行不了,报错:At least one partition-by expression must be specified
看源码:

2.3.x

/**
   * Returns a new Dataset partitioned by the given partitioning expressions into
   * `numPartitions`. The resulting Dataset is range partitioned.
   *
   * At least one partition-by expression must be specified.
   * When no explicit sort order is specified, "ascending nulls first" is assumed.
   * Note, the rows are not sorted in each partition of the resulting Dataset.
   *
   * @group typedrel
   * @since 2.3.0
   */
  @scala.annotation.varargs
  def repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T] = {
    require(partitionExprs.nonEmpty, "At least one partition-by expression must be specified.")
    val sortOrder: Seq[SortOrder] = partitionExprs.map(_.expr match {
      case expr: SortOrder => expr
      case expr: Expression => SortOrder(expr, Ascending)
    })
    withTypedPlan {
      RepartitionByExpression(sortOrder, planWithBarrier, numPartitions)
    }
  }

原来是2.3.x引入了新的底层实现

解决

  1. 把pom改成2.2.2解决问题。

  2. 把 .repartition(20,col(“identify_id”)) 替换成.repartitionByRange(42, col(“identify_id”),col(“platform_type”))//仅2.3.x

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值