Spark_SparkSQL_broadcast join不生效问题

高达一号

于 2023-09-06 10:24:39 发布

阅读量267

点赞数

分类专栏： Spark 文章标签： spark 大数据分布式

原文链接：https://it.cha138.com/javascript/show-99720.html

版权

Spark 专栏收录该内容

67 篇文章 9 订阅

订阅专栏

问题与排查过程

大数据计算通常会存在大表join小表的情况，如果相对较小的表允许广播到各个executor的话，可以使用广播方式mapjoin，这样还可以避免数据倾斜。

平时看文档记着有个参数是：

spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run.

看到auto以为spark执行join时候会根据表的大小自动切换广播join；今天跑任务时候发现虽然满足这个阈值却无法进行广播join，只好求助于官方文档，最后发现描述是：

可以配置如上属性的阈值，指定一个进行广播join的小表大小临界值，当数值设置为-1时候禁止使用广播join，最后重点：表大小的统计信息目前只支持Hive Metastore tables，言外之意只有表可以借助 ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan命令可以获取大小的hive表才可以进行广播join，因此如果你的小表是一个DataFrame计算而来的小表进行join的时候也就不会进行自动优化为广播join了。

查看Hive分区表的统计分析：

ANALYZE TABLE app_user_order(dt='2019-05-01') COMPUTE STATISTICS noscan;

输出：

Partition app.app_dm_online_logdt=2019-05-01 stats: [numFiles=1, numRows=178, totalSize=308285, rawDataSize=308107]

因此可以推断Spark进行优化广播join时候获取小表信息是根据元数据信息获取的，看源码可以查到相关证据。根据如下三个方法调用就可以看到计算大小的逻辑：

org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats#stats

org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor#default

org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils#getOutputSize:

def getOutputSize(
      attributes: Seq[Attribute],
      outputRowCount: BigInt,
      attrStats: AttributeMap[ColumnStat] = AttributeMap(Nil)): BigInt = 
    // We assign a generic overhead for a Row object, the actual overhead is different for different
    // Row format.
    val sizePerRow = 8 + attributes.map  attr =>
      if (attrStats.contains(attr)) 
        attr.dataType match 
          case StringType =>
            // UTF8String: base + offset + numBytes
            attrStats(attr).avgLen + 8 + 4
          case _ =>
            attrStats(attr).avgLen
        
       else 
        attr.dataType.defaultSize
      
    .sum

    // Output size can't be zero, or sizeInBytes of BinaryNode will also be zero
    // (simple computation of statistics returns product of children).
    if (outputRowCount > 0) outputRowCount * sizePerRow else 1

如上代码便是评估Hive数据表的大小。

解决方法

那对于经过transform而来的小表是不是就不能进行广播join了呢？答案是可以的，可以使用org.apache.spark.sql.functions#broadcast进行对小表进行强制广播join，但是需要表大小合适进行广播join并且保证Driver以及Executor内存足够。

最后回头了解一下Hive command:

ANALYZE TABLE db.tableName(dt='2019-05-01')  COMPUTE STATISTICS noscan;

这个命令很有用，大概简单描述可以做如下事情：查看Hive表或者分区信息，例如：行数，分区数，文件数，以及数据大小（byte）；也可以使用describle extended tablename进行查看统计信息。
hive统计信息何时被计算？通常，当你创建表之后Hive会自动计算统计数据信息并存储到metastore，可以在创建表时候显示指定：

hive.stats.autogather=false

这样表的统计信息将不会被自动计算。

除此之外还需要注意的是：df.join(broadcast(small_df),Seq("join_key"),"join type") 需要将broadcast(small_df)放在join的右边，否则不会执行广播join。

参考：

Performance Tuning - Spark 3.4.1 Documentation

Spark SQL中的broadcast join分析_dabokele的博客-CSDN博客

有关Hive Analyze的命令参考：

Column Statistics in Hive - Apache Hive - Apache Software Foundation

Hive ANALYZE TABLE Command - Table Statistics - DWgeek.com

以上是关于Spark SQL有关broadcast join的不生效问题的主要内容，如果未能解决你的问题，请参考以下文章

高达一号

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark_SparkSQL_broadcast join不生效问题

大数据计算通常会存在大表join小表的情况，如果相对较小的表允许广播到各个executor的话，可以使用广播方式mapjoin，这样还可以避免数据倾斜。看到auto以为spark执行join时候会根据表的大小自动切换广播join；
复制链接

扫一扫