datasourceV2
一文理解 Apache Spark DataSource V2 诞生背景及入门实战
https://zhuanlan.zhihu.com/p/83006243
2.3 Data source API v2
https://issues.apache.org/jira/browse/SPARK-15689
Because of the above limitations/issues, the built-in data source implementations (like parquet, json, etc.) inside Spark SQL are not using this public Data Source API. Instead, they use an internal/non-public interface.
https://issues.apache.org/jira/browse/SPARK-13664
动机
Since its input arguments include DataFrame/SQLContext, the data source API compatibility depends on the upper level API.
依赖 DataFrame/SQLContext, 上层api
The physical storage information (e.g., partitioning and sorting) is not propagated from the data sources, and thus, not used in the Spark optimizer.
分区排序没有传播给spark
Extensibility is not good and operator push-down capabilities are limited.
扩展的下推能力被限制
Lacking columnar read interface for high performance.
列的读接口
The write interface is so general without transaction supports.
没有事务支持
spark FileFormatWriter
org.apache.spark.sql.execution.datasources.FileFormatWriter
// We should first sort by partition columns, then bucket id, and finally sorting columns.
val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
对写入数据的要求,可以影响logical plan
如何去解决spark sql 小文件
https://aws.amazon.com/cn/blogs/china/application-and-practice-of-spark-small-file-merging-function-on-aws-s3/
https://blog.csdn.net/zcypaicom/article/details/128250515