通过SparkSQL,对两个存在map类型字段的Hive表进行union操作,报如下错误:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map is map<string,string>;
1. 场景模拟
1)通过函数str_to_map/map生成map类型的字段,然后进行union操作
select 1 id, str_to_map("k1:v1,k2:v2") map
union
select 2 id, map("k1","v1","k2","v2") map
2)报错信息
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map is map<string,string>;;
Distinct
+- Union
:- Project [1 AS id#116, str_to_map(k1:v1,k2:v2, ,, :) AS map#117]
: +- OneRowRelation
+- Project [2 AS id#118, map(k1, v1, k2, v2) AS map#119]
+- OneRowRelation
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:364)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
2. 问题分析
根据报错信息,我们查看org.apache.spark.sql.catalyst.analysis.CheckAnalysis的checkAnalysis方法,第362行源码处理逻辑(错误信息是不是很熟悉呢?):
关键看mapColumnInSetOperation中对逻辑计划的匹配:
针对逻辑计划中有Intersect、Except、Distinct的output"返回"的属性(Attribute)有map类型,或者Deduplicate的keys(也是Attribute)包含map字段类型,都会导致上述问题。
而union导致上述报错,是因为union会对结果去重,即distinct
3. 解决方案
询问后台小伙伴儿,目前的业务场景是考验不需要去重处理的。
那么我们都知道,union和union all的主要区别就是,前者会对结果去重,后者则不会。那么将union改为union all就好了。
关注 微信公众号:大数据学习与分享,获取更多技术干货