避免在Spark 2.x版本中使用sparkSQL，关于CTAS bug的发现过程

最新推荐文章于 2023-02-10 23:46:46 发布

子安

最新推荐文章于 2023-02-10 23:46:46 发布

阅读量2.8k

点赞数 1

分类专栏： Spark 文章标签： spark bug SparkSQL CTAS

本文链接：https://blog.csdn.net/bon_mot/article/details/75256525

版权

Spark 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

避免在Spark 2.x版本中使用sparkSQL，关于CTAS bug的发现过程

标签（空格分隔）： Spark2.x sparkSQL CTAS

避免在Spark 2x版本中使用sparkSQL关于CTAS bug的发现过程
背景
问题发现过程
- 1 问题发现
- 2 问题重现
尝试解决问题
解决方案
最后结论

1. 背景

CTAS就是create table as select的简称。

最近在使用SparkSQL来进行快速的自定义SQL分析，因为需要把分析的结果保存下来，所以一定要使用CTAS功能，然而在使用的时候发现了一个bug，当然这个bug已经被报告了，状态依然是unresolved。

如果有下面几个标题的，一般和该问题关系密切

Thrift Server - CTAS fail with Unable to move source
Replace hive.default.fileformat by spark.sql.default.fileformat
Spark sql 2.1.1 thrift server - unable to move source hdfs to target
SparkSQL cli throws exception when using with Hive 0.12 metastore in spark-1.5.0 version

2. 问题发现过程

2.1 问题发现

最开始的我们使用beeline登录

beeline -u jdbc:hive2://xxx.xxx.xxx.xxx:10000 -n user_name

然后使用一个库

use test;

查看一下表

show tables;
+-----------+-------------------------+--------------+--+
| database  |        tableName        | isTemporary  |
+-----------+-------------------------+--------------+--+
| test      | test                    | false        |
| test      | test2                   | false        |
+-----------+-------------------------+--------------+--+

然后我们drop掉该表

drop table test;

然后再创建

create table test as select * from test2;

创建成功，反复几次，也没有问题，一开始我以为好了，我们可以放心使用了，好开心！！

但是不放心，多做几轮测试。于是，我重新使用了我的JAVA的代码执行了这个语句，报错了，这个让我很疑惑，明明什么都一样。

错误如下：

Error: org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move
source hdfs://ns1/tmp/hive/spark-test_hive_2017-07-12_15-38-
47_540_4854595148769740436-6/-ext-10000/part-00000 to destination
hdfs://ns1/user/test/hive/test/part-00000; (state=,code=0)

2.2 问题重现

刚开始打算看看是不是那个地方配置问题，心想再用beeline试试，于是重新使用beeline进行JDBC连接。
上来没有drop表，直接使用CTAS试试，发现直接报上面的错误。

然后使用sparkSQL试试，发现报OOM错误，以为是SparkSQL启动不合理，重新调整了一个配置，再启动，然后再用beeline连接，drop后建表，好了，试验了三次，以为该问题解决，再次非常开心！！！

但是问题重大，仍不放心，因为感觉不是OOM问题，因为毕竟查了一些资料，都是指向文件系统关闭的问题。

这次重启直接让我们发现了这个bug。

于是再次使用jdbc连接，这次果然报错了！原来JDBC只有第一次连接，可以反复drop反复创建，第二次再连接就报错。这个过程我们重现了若干次，已经非常确信。

于是继续在网上搜索资料。

3. 尝试解决问题

3.1 网上建议1

第一个参考是这个：
https://stackoverflow.com/questions/44233523/spark-sql-2-1-1-thrift-server-unable-to-move-source-hdfs-to-target

这个回答如下：

 Try setting hive.exec.staging-dir in your hive-site.xml like this:
 <property>
   <name>hive.exec.stagingdir</name>
   <value>/tmp/hive/spark-${user.name}</value>
</property>

　This worked for a customer who upgraded from 1.6.2 to 2.1.1 and who had that same problem
with CTAS. On our dev cluster, doing this got us past your particular error, but we still have some HDFS permission issues we are working through.

然而，试验过发现这个配置没有任何用途，SparkSQL根本不读这个配置。这个答案的来源最早应该是源于
下面的SparkSQL cli throws exception when using with Hive 0.12 metastore in spark-1.5.0 version。

其实如果细心一点，会发现这个问题中包含了几个回答，和后面的建议直接相关，

第一个回答后的回复和建议2是一样的。
第二个回答则和我们的结论直接一致，最开始看的时候其实并没有明白，因为问题还不太清晰。

3.2 网上建议2

然后继续搜索，发现标题如下的JIRA列表

Thrift Server - CTAS fail with Unable to move source

地址如下：
https://issues.apache.org/jira/browse/SPARK-21067

描述如下：

Description

After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS would fail, sometimes…

Most of the time, the CTAS would work only once, after starting the thrift server. After that, dropping the table and re-issuing the same CTAS would fail with the following message (Sometime, it fails right away, sometime it work for a long period of time):

这个就和我的问题是一模一样了，Spark版本一样，问题症状一样，报的错误也一样。
刚开始的回答和上面的一样，其实是没用的，后面有一句话：

We are either looking for a fix or for a property to set hive.default.fileformat in Spark 2 to have it use parquet instead of textfile, since the issue is not present when the fileformat is set to “parquet”.

他们说Parquet格式没问题，于是我把文件类型设置为Parquet，但是仍然没有用，设置的命令如下：

set spark.sql.default.fileformat=Parquet;

由于该问题仍然属于Open状态，可以肯定的是这个bug仍然没有修复。

3.3 组合方案

更改文件类型为Parquet、包括更改staging目录，两者的组合也试验过（在hive-site.xml里面也配置了）
set hive.exec.stagingdir=/tmp/hive/spark-test;
结果都是一样。

这个Parquet设置，在https://issues.apache.org/jira/browse/SPARK-16825里面也提到过，标题是：
Replace hive.default.fileformat by spark.sql.default.fileformat
这个里面提出了在hive-site.xml里面配置staging目录是无效的，因为根本不会去读。

在下面这个地方，对这个问题也有讨论
https://github.com/apache/spark/pull/14430

4 解决方案

在一个客户端上将Spark版本回退到Spark1.5.x，问题解决，另外一个客户端使用Spark2.1.1，继续使用机器学习的Spark MLlib。

5 最后结论

Spark2.1.1以及后续的版本在CTAS问题上存在严重bug，暂未修复，无法使用，此处为坑，慎重。

子安

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
避免在Spark 2.x版本中使用sparkSQL，关于CTAS bug的发现过程

避免在Spark 2.x版本中使用sparkSQL，关于CTAS bug的发现过程标签（空格分隔）： Spark2.x sparkSQL CTAS避免在Spark 2x版本中使用sparkSQL关于CTAS bug的发现过程背景问题发现过程1 问题发现2 问题重现尝试解决问题1 网上建议12 网上建议23 组合方案解决方案最后结论1. 背景CTAS就是create table a
复制链接

扫一扫