Spark读写Hive表导致cache失效问题

最新推荐文章于 2023-08-06 13:00:23 发布

淡定一生2333

最新推荐文章于 2023-08-06 13:00:23 发布

阅读量889

点赞数

分类专栏： Spark学习

本文链接：https://blog.csdn.net/zc19921215/article/details/114480052

版权

Spark学习专栏收录该内容

31 篇文章 5 订阅

订阅专栏

最近写spark程序的时候发现了一个奇怪问题。我cache了两个DataFrame，暂且称它们为A和B好了，然后将A和B通过SQL语句以inser overwrite的方式分别写入到两张Hive表中。但是发现如果我先将A写入再写入B，B这个DataFrame会触发重新计算...曾经一度以为自己触发了Spark某个未知的Bug，但是突发奇想将代码的顺序换了一下，先写入B再写入A，发现cache又是生效的。

找了半天终于在Spark的jira上找了如下的描述(“参考一”)：

When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation. And we choose between the existing mode and the new mode for different cache invalidation scenarios:

Drop tables and regular (persistent) views: regular mode
Drop temporary views: non-cascading mode
Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
Call DataSet.unpersist(): non-cascading mode
Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.

大意就是说，如果一个DataFrame_first依赖于DataFrame_second，如果DataFrame_second发生了变化，那么DataFrame_first的cache也就失效了，这样做的目的是为了保证DataFrame_second中的数据是最新的，如果DataFrame_first中的数据发生了变化的话。

本人程序中DataFrame之间的依赖关系如下：