spark datasourceV1和v2

better_mouse

已于 2023-02-27 19:25:57 修改

阅读量514

点赞数

分类专栏： spark源码大数据文章标签： spark 大数据分布式

于 2023-02-27 14:49:59 首次发布

本文链接：https://blog.csdn.net/better_mouse/article/details/127546993

版权

spark源码同时被 2 个专栏收录

12 篇文章 1 订阅

订阅专栏

大数据

5 篇文章 0 订阅

订阅专栏

datasourceV2

一文理解 Apache Spark DataSource V2 诞生背景及入门实战
https://zhuanlan.zhihu.com/p/83006243

2.3 Data source API v2

https://issues.apache.org/jira/browse/SPARK-15689

Because of the above limitations/issues, the built-in data source implementations (like parquet, json, etc.) inside Spark SQL are not using this public Data Source API. Instead, they use an internal/non-public interface.

https://issues.apache.org/jira/browse/SPARK-13664

动机

Since its input arguments include DataFrame/SQLContext, the data source API compatibility depends on the upper level API.
依赖 DataFrame/SQLContext, 上层api

The physical storage information (e.g., partitioning and sorting) is not propagated from the data sources, and thus, not used in the Spark optimizer.
分区排序没有传播给spark

Extensibility is not good and operator push-down capabilities are limited.
扩展的下推能力被限制

Lacking columnar read interface for high performance.
列的读接口

The write interface is so general without transaction supports.
没有事务支持

spark FileFormatWriter

org.apache.spark.sql.execution.datasources.FileFormatWriter

// We should first sort by partition columns, then bucket id, and finally sorting columns.
val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns

对写入数据的要求,可以影响logical plan