Spark Sql性能调优问题

最新推荐文章于 2024-01-30 19:31:19 发布

soaring0121

最新推荐文章于 2024-01-30 19:31:19 发布

阅读量678

点赞数

分类专栏：大数据 SPARK 文章标签：大数据 hive 笛卡尔积

本文链接：https://blog.csdn.net/soaring0121/article/details/108449416

版权

大数据同时被 2 个专栏收录

11 篇文章 1 订阅

订阅专栏

SPARK

7 篇文章 0 订阅

订阅专栏

首先我的业务场景是对大量的数据（百万级）进行cpu密集型的计算，一次全量计算需要8个小时左右。计算结果分别简单处理后写入hive和Es。在使用spark sql处理时遇到了两个性能问题：

1. 由于单次计算非常耗时，因此使用dataframe.cache()后再分别写入hive和ES，但实际运算了两遍，缓存没有按预想的生效。

2. 全量计算非常耗时，因此基于业务特点只对增量数据运算。使用了case when，和spark.sql.function中的when otherwise逻辑上做了增量计算，但实际还是全量计算。

# cpu密集型计算逻辑
def cpu_bound_compute(content):
    return compute(content)
hive_context.registerFunction("cpu_bound_compute", cpu_bound_compute)

sql = """
    select uid, content, cpu_bound_compute(content) computed, time
    from source_table
    where date={date}
""".format(date=current_date)

# step1.运算并缓存结果
data_frame = hive_context.sql(sql).cache()

# step2.创建视图并写入hive
data_frame.createTempView("view_table")
insert_sql = """
    insert overwrite table sink_table partition(date='{date}')
    select *
from view_table
""".format(date=current_date)

# step3.写入es
data_frame.write("es")

上面是产生性能问题的简化样例代码，通过运行时的DAG图以及Stages表，可以很清晰的看出在step2和step3中均进行了step1的计算，最终耗时16个小时！！

问题1排查：

# step1.运算并缓存结果
data_frame = hive_context.sql(sql).cache()

# step2.触发计算逻辑
data_frame.count()

# step3.测试缓存是否失效
data_frame.createTempView("view_table")
insert_sql = """
    insert overwrite table sink_table partition(date='{date}')
    select *
from view_table
""".format(date=current_date)

# step4.测试缓存是否失效
data_fram.count()

为了排查问题，将程序改造如上图并运行，这时通过DAG图看到，step3时缓存生效，跳过了step1的计算逻辑；但到step4又开始重复step1的计算，说明缓存在此失效。基于这些可以推测，createTempView()函数的运行会导致缓存的失效。因此对dataframe的操作置于视图操作之前，才能避免缓存失效的问题。

总结：所有cache后的dataframe操作需要放在视图操作之前来避免缓存失效

问题2排查：

解决1的问题后，计算耗时缩短到8个小时，但依旧时间太久了。我们业务数据的特点是每天都是全量数据，当天数据比之前数据新增在十万左右，也就是90%的运算是重复的，只要把运算改成增量，理论上可以缩短90%的时间。因此，将运算逻辑改造如下：

sql = """
    select a.uuid, content, 
        case when coumputed is null then cpu_bound_compute(content)
            else computed 
            end as computed, 
        time
    from (
        select uuid, content,time
        from source_table
        where date={date}
    ) a left join (
        select uuid, computed
        from sink_table
        where date={one_date_ago}
    ) on a.uuid = b.uuid
""".format(date=current_date, one_date_ago=one_date_ago)

## 当然这里也可以不在sql中进行增量计算，而是使用spark.sql.function中的when otherwise函数，
## 其逻辑和最终效果和case when一致，这里不再赘述

上图计算逻辑运行后，并没有按预期那样缩短时间，，其DAG图和问题一优化后的DAG图几乎一样，最后的运行时长也相差无几。原因细节未知，但可以猜测这种条件式增量逻辑依然会全量处理。因此采用了将两类数据分开处理的方式，优化效果显著，运行时长缩短到半个小时。优化代码如下：

sql = """
with base as(
    select a.uuid, content,  computed, time
    from (
        select uuid, content,time
        from source_table
        where date={date}
    ) a left join (
        select uuid, computed
        from sink_table
        where date={one_date_ago}
    ) on a.uuid = b.uuid
)
select uuid, content, computed, time 
from base 
where computed is not null
union all 
select uuid, content, cpu_bound_compute(content) computed, time 
from base 
where computed is not null
""".format(date=current_date, one_date_ago=one_date_ago)

综上，本文主要处理了两个spark sql隐藏的性能上的坑：

1. spark sql中对df的缓存会在在执行createTempView("view") 视图操作后失效

2.case when或when otherwise等条件函数无法分离数据运算，达到性能优化的效果