spark操作ES中数据出现的问题

最新推荐文章于 2024-08-18 16:00:46 发布

day_ue

最新推荐文章于 2024-08-18 16:00:46 发布

阅读量522

点赞数

分类专栏： ElasticSearch Spark 文章标签： spark elasticsearch

本文链接：https://blog.csdn.net/day_ue/article/details/120992103

版权

ElasticSearch 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

Spark

4 篇文章 0 订阅

订阅专栏

Spark读取Es空字符串问题

问题描述：

order_tp在Es中存储为空字符串，读取到Spark中会出现各种意想不到的问题。

ES中存储格式

{
     "_index": "dwd_monitor_yuepengfei_test",
     "_type": "doc",
     "_id": "15",
     "_score": 1,
     "_source": {
       "chnl_cd": "10",
       "order_tp": ""
     }
}

spark中数据打印格式

+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
|     10|    null|
+-------+--------+

根据order_tp字段无法筛选出上面那条数据

spark.sql("select * from test where order_tp = ''").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
spark.sql("select * from test where order_tp <> ''").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
spark.sql("select * from test where order_tp is null").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+

字段order_tpl无法查询筛选，却可以和null值聚合

spark
.sql("select 1 as chnl_cd, null as order_tp").withColumn("a", lit(-1))
.union(spark.sql("select chnl_cd,order_tp from test").withColumn("a", lit(1)))
.groupBy("order_tp")
.agg(Map("a" -> "collect_list")).show()

+--------+---------------+
|order_tp|collect_list(a)|
+--------+---------------+
|    null|        [-1, 1]|
+--------+---------------+

由于上述ES中的空字符串，在SparkSQL不能确定是怎样的存在。因此使用中尽量避免用空字符串，统一为null

Spark读取ES中的数组

ES没有数组字段的定义，存储String数组，定义的mapping为keyword类型。spark读取的数据时读到的maping为String类型，但是数据是数组类型报错。

常规解决方式

option("es.read.field.as.array.include", "数组字段")

day_ue

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark操作ES中数据出现的问题

Spark读取Es空字符串问题问题描述：order_tp在Es中存储为空字符串，读取到Spark中会出现各种意想不到的问题。ES中存储格式{ "_index": "dwd_monitor_yuepengfei_test", "_type": "doc", "_id": "15", "_score": 1, "_source": { "chnl_cd": "10", "order_tp": "" }}
复制链接

扫一扫

专栏目录