Spark读取Es空字符串问题
问题描述:
order_tp在Es中存储为空字符串,读取到Spark中会出现各种意想不到的问题。
- ES中存储格式
{
"_index": "dwd_monitor_yuepengfei_test",
"_type": "doc",
"_id": "15",
"_score": 1,
"_source": {
"chnl_cd": "10",
"order_tp": ""
}
}
- spark中数据打印格式
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
| 10| null|
+-------+--------+
- 根据order_tp字段无法筛选出上面那条数据
spark.sql("select * from test where order_tp = ''").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
spark.sql("select * from test where order_tp <> ''").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
spark.sql("select * from test where order_tp is null").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
- 字段order_tpl无法查询筛选,却可以和null值聚合
spark
.sql("select 1 as chnl_cd, null as order_tp").withColumn("a", lit(-1))
.union(spark.sql("select chnl_cd,order_tp from test").withColumn("a", lit(1)))
.groupBy("order_tp")
.agg(Map("a" -> "collect_list")).show()
+--------+---------------+
|order_tp|collect_list(a)|
+--------+---------------+
| null| [-1, 1]|
+--------+---------------+
由于上述ES中的空字符串,在SparkSQL不能确定是怎样的存在。因此使用中尽量避免用空字符串,统一为null
Spark读取ES中的数组
ES没有数组字段的定义,存储String数组,定义的mapping为keyword类型。spark读取的数据时读到的maping为String类型,但是数据是数组类型报错。
常规解决方式
option("es.read.field.as.array.include", "数组字段")