1.V1使用 SQLContext , Dataframes and RDD.
In spark 2.0 SQLContext got deprecated. It got replaced by SparkSession. Also DataFrame superseded by Dataset API. But spark has not able to update data source API to reflect these changes.
2、缺乏列式读取
spark data source reads data in Row format. Even though internal spark engine supports columnar data representation it’s not exposed to data sources. But many data sources used for analytics are columnar by nature. So there is unnecessary translation of columnar data source to row in connector and back to columnar in spark engine.
To avoid this in internal columnar format like parquet spark uses internal API’s. But it’s not possible for third party libraries.This impacts their performance.
3、Lack of Partitioning and Sorting Info
In data source v1 API, a data source cannot pass the partition information to spark engine. This is not good for databases like Hbase/ Cassandra which have optimised for partition access. In data source V1 API , when spark reads data from these sources it will not try to co locate the processing with partitions which will result in poor performance.
Spark built in sources overcome this limitation using the internal API’s. That’s why spark inbuilt sources much more performant than third party ones
4、No transaction support in write intercace
Current write interface is very generic. It was built primarily to support to store data in systems like HDFS. But more sophisticated sinks like databases needed more control over data write. For example, when data is written to partially to database and job aborts, it will not cleanup those rows. It’s not an issue in HDFS because it will track non successful writes using _successful file . But those facilities are not there in databases. So in this scenario, database will be in inconsistent state. Databases handle these scenarios using transactions which is not supported in current data source API’s.
5.Limited Extendability
Current data source API only supports filter push down and column pruning. But many smart sources, data sources with processing power, do more capabilities than that. Currently data source API doesn’t good mechanism to push more catalyst expressions to underneath source.