parquet在spark，impala，hive等的兼容性分析

最新推荐文章于 2021-08-22 16:42:24 发布

袁一白

最新推荐文章于 2021-08-22 16:42:24 发布

阅读量540

点赞数

分类专栏： bigdata

本文链接：https://blog.csdn.net/qq1226317595/article/details/108404089

版权

bigdata 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

Parquet是一种存储格式，其本身与任何语言、平台都没有关系，也不需要与任何一种数据处理框架绑定。但是一个开源技术的发展，必然需要有合适的生态圈助力才行，Spark便是Parquet的核心助力之一。作为内存型并行计算引擎，Spark被广泛应用在流处理、离线处理等场景，其从1.0.0便开始支持Parquet，方便我们操作数据。
Apache Arrow是Apache基金会下一个全新的开源项目，同时也是顶级项目。它的目的是作为一个跨平台的数据层来加快大数据分析项目的运行速度。

在数据挖掘小组，语言是python，所以parquet的写入自然就选择pyarrow。
所以：

这是基本的结构，但是现在出现一个不兼容的地方：
有一批数据：
impala查询
在这里插入图片描述

hive查询：
在这里插入图片描述
spark查询也是一对NULL

参考：https://zhuanlan.zhihu.com/p/113213420
查看数据文件的schema，发现：

{"index_columns": [], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "platform_type", "field_name": "platform_type", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "account_id", "field_name": "account_id", "pandas_type": "int32", "numpy_type": "int64", "metadata": null}, {"name": "identify_id", "field_name": "identify_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N1", "field_name": "N1", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N2", "field_name": "N2", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N3", "field_name": "N3", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N4", "field_name": "N4", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "N5", "field_name": "N5", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID1", "field_name": "ID1", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID2", "field_name": "ID2", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID3", "field_name": "ID3", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID4", "field_name": "ID4", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "ID5", "field_name": "ID5", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "IDA", "field_name": "IDA", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "NA", "field_name": "NA", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "update_time", "field_name": "update_time", "pandas_type": "int32", "numpy_type": "int32", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.17.1"}, "pandas_version": "0.25.3"}

然后发现，pyarrow的写入代码
在这里插入图片描述
而正确的应该是：

可以发现，impala并未严格读取schema，而是采用和text的方式，顺序对应上就OK。

修复
两者一致（数据中，表中）

袁一白

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
parquet在spark，impala，hive等的兼容性分析

Parquet是一种存储格式，其本身与任何语言、平台都没有关系，也不需要与任何一种数据处理框架绑定。但是一个开源技术的发展，必然需要有合适的生态圈助力才行，Spark便是Parquet的核心助力之一。作为内存型并行计算引擎，Spark被广泛应用在流处理、离线处理等场景，其从1.0.0便开始支持Parquet，方便我们操作数据。Apache Arrow是Apache基金会下一个全新的开源项目，同时也是顶级项目。它的目的是作为一个跨平台的数据层来加快大数据分析项目的运行速度。在数据挖掘小组，语言是pytho
复制链接

扫一扫