Spark读写Hive表导致cache失效问题

    最近写spark程序的时候发现了一个奇怪问题。我cache了两个DataFrame,暂且称它们为A和B好了,然后将A和B通过SQL语句以inser overwrite的方式分别写入到两张Hive表中。但是发现如果我先将A写入再写入B,B这个DataFrame会触发重新计算...曾经一度以为自己触发了Spark某个未知的Bug,但是突发奇想将代码的顺序换了一下,先写入B再写入A,发现cache又是生效的。

 

找了半天终于在Spark的jira上找了如下的描述(“参考一”):

When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation. And we choose between the existing mode and the new mode for different cache invalidation scenarios:

  1. Drop tables and regular (persistent) views: regular mode
  2. Drop temporary views: non-cascading mode
  3. Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
  4. Call DataSet.unpersist(): non-cascading mode
  5. Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.

   大意就是说,如果一个DataFrame_first依赖于DataFrame_second,如果DataFrame_second发生了变化,那么DataFrame_first的cache也就失效了,这样做的目的是为了保证DataFrame_second中的数据是最新的,如果DataFrame_first中的数据发生了变化的话。

 

 

 

本人程序中DataFrame之间的依赖关系如下:

哈哈哈哈,灵魂画手上线。

本人当时就是先写入的DataFrameB,导致DataFrameC的cache失效了。写入Hive的源码中cache失效的相关代码在如下位置:

ps: 不过DataFrame级联失效的规则在Spark2.4之后发生了变化,Spark2.4之后就不一定会导致级联的cache失效问题。但是我这个是写入Hive表,所以DataFrame的cache一定会失效!("参考二")

 

 

解决方案:

    1. 最简单的解决方式,先写入DataFreameC,再写入DataFrameB。 

    2. 但是如果这两个DataFrame后续还要再使用怎么半? 将DataFrame的cache转变成RDD的cache,然后将RDD再转成DataFrame使用  

 

 

参考:

    https://issues.apache.org/jira/browse/SPARK-24596(Spark 级联cache是否失效相关Jira界面)

    http://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html(Spark2.4之后级联的cache不一定会失效了)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值