第61课:SparkSQl数据加载和保存内幕深度解密实战学习笔记
本期内容:
1 SparkSQL加载数据
2 SparkSQL保存数据
3 SparkSQL对数据处理的思考
操作SparkSQL主要就是操作DataFrame,DataFrame提供了一些通用的LOAD、SAVE操作,
Spark版本:
大版本:主要是API变化的分支
版本:增加的特性
小版本:BUGS FIX版本
/**
* Returns the dataset stored at path as a DataFrame,
* using the default data source configured by spark.sql.sources.default.
*
* @group genericdata
* @deprecated As of 1.4.0, replaced by `read().load(path)`. This will be removed in Spark 2.0.
*/
@deprecated("Use read.load(path). This will be removed in Spark 2.0.", "1.4.0")
def load(path: String): DataFrame = {
read.load(path)
}
DataFrameReader:
* :: Experimental ::
* Interface used to load a [[DataFrame]] from external storage systems (e.g. file systems,
* key-value stores, etc). Use [[SQLContext.read]] to access this.
DataFrameReader中有format方法:
/**
* Specifies the input data source format.
*
* @since 1.4.0
*/
def format(source: String): DataFrameReader = {
this.source = source
this
}
读取数据时可以直接指定读取数据的文件类型,如JSON或Parquet。
/**
* Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema
* automatically from data. By specifying the schema here, the underlying data source can
* skip the schema inference step, and thus speed up data loading.
*
* @since 1.4.0
*/
def schema(schema: StructType): DataFrameReader = {
this.userSpecifiedSchema = Option(schema)
this
}
/**
* Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by
* a local or distributed file system).
*
* @since 1.4.0
*/
// TODO: Remove this one in Spark 2.0.
def load(path: String): DataFrame = {
option("path", path).load()
}
/**
* Loads input in as a [[DataFrame]], for data sources that don't require a path (e.g. external
* key-value stores).
*
*